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PREFACE 


In our present world of automation, cloud computing, algorithms, artificial intelligence, 
and big data, few topics are as relevant as data science and machine learning. Their recent 
popularity lies not only in their applicability to real-life questions, but also in their natural 
blending of many different disciplines, including mathematics, statistics, computer science, 
engineering, science, and finance. 

To someone starting to learn these topics, the multitude of computational techniques 
and mathematical ideas may seem overwhelming. Some may be satisfied with only learn- 
ing how to use off-the-shelf recipes to apply to practical situations. But what if the assump- 
tions of the black-box recipe are violated? Can we still trust the results? How should the 
algorithm be adapted? To be able to truly understand data science and machine learning it 
is important to appreciate the underlying mathematics and statistics, as well as the resulting 
algorithms. 

The purpose of this book is to provide an accessible, yet comprehensive, account of 
data science and machine learning. It is intended for anyone interested in gaining a better 
understanding of the mathematics and statistics that underpin the rich variety of ideas and 
machine learning algorithms in data science. Our viewpoint is that computer languages 
come and go, but the underlying key ideas and algorithms will remain forever and will 
form the basis for future developments. 

Before we turn to a description of the topics in this book, we would like to say a 
few words about its philosophy. This book resulted from various courses in data science 
and machine learning at the Universities of Queensland and New South Wales, Australia. 
When we taught these courses, we noticed that students were eager to learn not only how 
to apply algorithms but also to understand how these algorithms actually work. However, 
many existing textbooks assumed either too much background knowledge (e.g., measure 
theory and functional analysis) or too little (everything is a black box), and the information 
overload from often disjointed and contradictory internet sources made it more difficult for 
students to gradually build up their knowledge and understanding. We therefore wanted to 
write a book about data science and machine learning that can be read as a linear story, 
with a substantial “backstory” in the appendices. The main narrative starts very simply and 
builds up gradually to quite an advanced level. The backstory contains all the necessary 
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background, as well as additional information, from linear algebra and functional analysis 
(Appendix A), multivariate differentiation and optimization (Appendix B), and probability 
and statistics (Appendix C). Moreover, to make the abstract ideas come alive, we believe 
it is important that the reader sees actual implementations of the algorithms, directly trans- 
lated from the theory. After some deliberation we have chosen Python as our programming 
language. It is freely available and has been adopted as the programming language of 
choice for many practitioners in data science and machine learning. It has many useful 
packages for data manipulation (often ported from R) and has been designed to be easy to 
program. A gentle introduction to Python is given in Appendix D. 


To keep the book manageable in size we had to be selective in our choice of topics. 
Important ideas and connections between various concepts are highlighted via keywords 
and page references (indicated by a ®®) in the margin. Key definitions and theorems are 
highlighted in boxes. Whenever feasible we provide proofs of theorems. Finally, we place 
great importance on notation. It is often the case that once a consistent and concise system 
of notation is in place, seemingly difficult ideas suddenly become obvious. We use differ- 
ent fonts to distinguish between different types of objects. Vectors are denoted by letters in 
boldface italics, x, X, and matrices by uppercase letters in boldface roman font, A, K. We 
also distinguish between random vectors and their values by using upper and lower case 
letters, e.g., X (random vector) and x (its value or outcome). Sets are usually denoted by 
calligraphic letters G, H. The symbols for probability and expectation are P and E, respect- 
ively. Distributions are indicated by sans serif font, as in Bin and Gamma; exceptions are 
the ubiquitous notations N and U for the normal and uniform distributions. A summary of 
the most important symbols and abbreviations is given on Pages xvii—xxi. 


Data science provides the language and techniques necessary for understanding and 
dealing with data. It involves the design, collection, analysis, and interpretation of nu- 
merical data, with the aim of extracting patterns and other useful information. Machine 
learning, which is closely related to data science, deals with the design of algorithms and 
computer resources to learn from data. The organization of the book follows roughly the 
typical steps in a data science project: Gathering data to gain information about a research 
question; cleaning, summarization, and visualization of the data; modeling and analysis of 
the data; translating decisions about the model into decisions and predictions about the re- 
search question. As this is a mathematics and statistics oriented book, most emphasis will 
be on modeling and analysis. 


We start in Chapter 1 with the reading, structuring, summarization, and visualization 
of data using the data manipulation package pandas in Python. Although the material 
covered in this chapter requires no mathematical knowledge, it forms an obvious starting 
point for data science: to better understand the nature of the available data. In Chapter 2, we 
introduce the main ingredients of statistical learning. We distinguish between supervised 
and unsupervised learning techniques, and discuss how we can assess the predictive per- 
formance of (un)supervised learning methods. An important part of statistical learning is 
the modeling of data. We introduce various useful models in data science including linear, 
multivariate Gaussian, and Bayesian models. Many algorithms in machine learning and 
data science make use of Monte Carlo techniques, which is the topic of Chapter 3. Monte 
Carlo can be used for simulation, estimation, and optimization. Chapter 4 is concerned 
with unsupervised learning, where we discuss techniques such as density estimation, clus- 
tering, and principal component analysis. We then turn our attention to supervised learning 
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in Chapter 5, and explain the ideas behind a broad class of regression models. Therein, we 
also describe how Python’s statsmodels package can be used to define and analyze linear 
models. Chapter 6 builds upon the previous regression chapter by developing the power- 
ful concepts of kernel methods and regularization, which allow the fundamental ideas of 
Chapter 5 to be expanded in an elegant way, using the theory of reproducing kernel Hilbert 
spaces. In Chapter 7, we proceed with the classification task, which also belongs to the 
supervised learning framework, and consider various methods for classification, including 
Bayes classification, linear and quadratic discriminant analysis, K-nearest neighbors, and 
support vector machines. In Chapter 8 we consider versatile methods for regression and 
classification that make use of tree structures. Finally, in Chapter 9, we consider the work- 
ings of neural networks and deep learning, and show that these learning algorithms have a 
simple mathematical interpretation. An extensive range of exercises is provided at the end 
of each chapter. 


Python code and data sets for each chapter can be downloaded from the GitHub site: 


https://github.com/DSML- book 
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NOTATION 





We could, of course, use any notation we want; do not laugh at notations; 
invent them, they are powerful. In fact, mathematics is, to a large extent, in- 
vention of better notations. 


Richard P. Feynman 


We have tried to use a notation system that is, in order of importance, simple, descript- 
ive, consistent, and compatible with historical choices. Achieving all of these goals all of 
the time would be impossible, but we hope that our notation helps to quickly recognize 
the type or “flavor” of certain mathematical objects (vectors, matrices, random vectors, 
probability measures, etc.) and clarify intricate ideas. 

We make use of various typographical aids, and it will be beneficial for the reader to 
be aware of some of these. 


e Boldface font is used to indicate composite objects, such as column vectors x = 
[x1,...,X,]' and matrices X = [x;;]. Note also the difference between the upright bold 
font for matrices and the slanted bold font for vectors. 


e Random variables are generally specified with upper case roman letters X, Y, Z and their 
outcomes with lower case letters x,y,z. Random vectors are thus denoted in upper case 
slanted bold font: X = [X;,...,X,]. 


e Sets of vectors are generally written in calligraphic font, such as X, but the set of real 
numbers uses the common blackboard bold font R. Expectation and probability also use 
the latter font. 


e Probability distributions use a sans serif font, such as Bin and Gamma. Exceptions to 
this rule are the “standard” notations N and U for the normal and uniform distributions. 


e We often omit brackets when it is clear what the argument is of a function or operator. 
For example, we prefer EX? to E[X?]. 
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e We employ color to emphasize that certain words refer to a dataset, function, or 
package in Python. All code is written in typewriter font. To be compatible with past 
notation choices, we introduced a special blue symbol X for the model (design) matrix of 
a linear model. 


e Important notation such as 7, g, g* is often defined in a mnemonic way, such as 7 for 
“training”, g for “guess”, g* for the “star” (that is, optimal) guess, and £ for “loss”. 


e We will occasionally use a Bayesian notation convention in which the same symbol is 
used to denote different (conditional) probability densities. In particular, instead of writing 
fx(x) and fx; y(x|y) for the probability density function (pdf) of X and the conditional pdf 
of X given Y, we simply write f(x) and f(x|y). This particular style of notation can be of 
great descriptive value, despite its apparent ambiguity. 


General font/notation rules 


scalar 
vector 
random vector 


matrix 


en e a 


set 


=) 


estimate or approximation 


* 


be 


optimal 


average 


Common mathematical symbols 


y for all 

qJ there exists 

oc is proportional to 

L is perpendicular to 

~ is distributed as 

i ~iid are independent and identically distributed as 
PRS is approximately distributed as 

Vf gradient of f 

Vr Hessian of f 

f EeP f has continuous derivatives of order p 
x is approximately 

= is asymptotically 

< is much smaller than 


® direct sum 


Notation 
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GD © 


elementwise product 
intersection 

union 

is defined as 

converges almost surely to 
converges in distribution to 
converges in probability to 
converges in Lp-norm to 
Euclidean norm 

smallest integer larger than x 


largest integer smaller than x 


max{x, 0} 


Matrix/vector notation 


AT, x' 
A`! 
At 
A-T 
A>0 
A>0 
dim(x) 
det(A) 
|A| 
tr(A) 


transpose of matrix A or vector x 

inverse of matrix A 

pseudo-inverse of matrix A 

inverse of matrix A7 or transpose of A~! 
matrix A is positive definite 

matrix A is positive semidefinite 

dimension of vector x 

determinant of matrix A 

absolute value of the determinant of matrix A 


trace of matrix A 


Reserved letters and words 


C 
d 
E 
e 
f 
g 


1{A} or 14 
i 
€ 


set of complex numbers 

differential symbol 

expectation 

the number 2.71828... 

probability density (discrete or continuous) 
prediction function 

indicator function of set A 

the square root of —1 


risk: expected loss 


XX 


Notation 





Loss loss function 

In (natural) logarithm 

N set of natural numbers {0, 1,...} 

O big-O order symbol: f(x) = O(g(x)) if |f| < g(x) for some constant œ as 


x7 a 
little-o order symbol: f(x) = o(g(x)) if f(x)/g(x) —> 0 as x —> a 


probability measure 
the number 3.14159... 


set of real numbers (one-dimensional Euclidean space) 


mA o 


A 


n-dimensional Euclidean space 


A 
+ 


positive real line: [0, co) 
deterministic training set 
random training set 


model (design) matrix 


Ney A 


set of integers {...,-1,0,1,...} 


Probability distributions 


Ber Bernoulli 
Beta beta 

Bin binomial 
Exp exponential 
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CHAPTER 1 





IMPORTING, SUMMARIZING, AND 
VISUALIZING DATA 





This chapter describes where to find useful data sets, how to load them into Python, 
and how to (re)structure the data. We also discuss various ways in which the data can 
be summarized via tables and figures. Which type of plots and numerical summaries 
are appropriate depends on the type of the variable(s) in play. Readers unfamiliar with 
Python are advised to read Appendix D first. 


1.1 Introduction 


Data comes in many shapes and forms, but can generally be thought of as being the result 
of some random experiment — an experiment whose outcome cannot be determined in 
advance, but whose workings are still subject to analysis. Data from a random experiment 
are often stored in a table or spreadsheet. A statistical convention is to denote variables — 
often called features — as columns and the individual items (or units) as rows. It is useful 
to think of three types of columns in such a spreadsheet: 


1. The first column is usually an identifier or index column, where each unit/row is 
given a unique name or ID. 


2. Certain columns (features) can correspond to the design of the experiment, specify- 
ing, for example, to which experimental group the unit belongs. Often the entries in 
these columns are deterministic; that is, they stay the same if the experiment were to 
be repeated. 


3. Other columns represent the observed measurements of the experiment. Usually, 
these measurements exhibit variability; that is, they would change if the experiment 
were to be repeated. 


There are many data sets available from the Internet and in software packages. A well- 
known repository of data sets is the Machine Learning Repository maintained by the Uni- 
versity of California at Irvine (UCI), found at https://archive.ics.uci.edu/. 
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1.1. Introduction 





w= 485 





These data sets are typically stored in a CSV (comma separated values) format, which 
can be easily read into Python. For example, to access the abalone data set from this web- 
site with Python, download the file to your working directory, import the pandas package 


via 
import pandas as pd 


and read in the data as follows: 


abalone = pd.read_csv('abalone.data',header = None) 


It is important to add header = None, as this lets Python know that the first line of the 
CSV does not contain the names of the features, as it assumes so by default. The data set 
was originally used to predict the age of abalone from physical measurements, such as 
shell weight and diameter. 

Another useful repository of over 1000 data sets from various packages in the R pro- 
gramming language, collected by Vincent Arel-Bundock, can be found at: 


https://vincentarelbundock.github.io/Rdatasets/datasets.html. 


For example, to read Fisher’s famous iris data set from R’s datasets package into Py- 
thon, type: 


urlprefix = 'https://vincentarelbundock.github.io/Rdatasets/csv/' 


dataname = 'datasets/iris.csv' 
iris = pd.read_csv(urlprefix + dataname) 





The iris data set contains four physical measurements (sepal/petal length/width) on 
50 specimens (each) of 3 species of iris: setosa, versicolor, and virginica. Note that in this 
case the headers are included. The output of read_csv is a DataFrame object, which is 
pandas’s implementation of a spreadsheet; see Section D.12.1. The DataFrame method 
head gives the first few rows of the DataFrame, including the feature names. The number 
of rows can be passed as an argument and is 5 by default. For the iris DataFrame, we 
have: 


iris.head() 


Unnamed: Sepal.Lengt fee Petal.Width Species 

De eee 0.2 setosa 
setosa 
setosa 
setosa 


0 
1 
2 
3 
4 
5 setosa 


0 
0. 
0. 
0 


[5 rows x 6 columns] 


The names of the features can be obtained via the columns attribute of the DataFrame 
object, as in iris.columns. Note that the first column is a duplicate index column, whose 
name (assigned by pandas) is 'Unnamed: 0'. We can drop this column and reassign the 
iris object as follows: 


iris = iris.dropC('Unnamed: 0',1) 
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The data for each feature (corresponding to its specific name) can be accessed by using 
Python’s slicing notation []. For example, the object iris[’Sepal.Length’] contains 
the 150 sepal lengths. 

The first three rows of the abalone data set from the UCI repository can be found as 
follows: 


abalone.head (3) 


2 3 4 5 6 7 
0.365 0.095 0.5140 0.2245 0.1010 0.150 
0.265 0.090 0.2255 0.0995 0.0485 0.070 
0.420 0.135 0.6770 0.2565 0.1415 0.210 





Here, the missing headers have been assigned according to the order of the natural 
numbers. The names should correspond to Sex, Length, Diameter, Height, Whole weight, 
Shucked weight, Viscera weight, Shell weight, and Rings, as described in the file with the 
name abalone .names on the UCI website. We can manually add the names of the features 
to the DataFrame by reassigning the columns attribute, as in: 


abalone.columns = ['Sex', 'Length', 'Diameter', 'Height', 


'Whole weight','Shucked weight', 'Viscera weight', 'Shell weight', 
'Rings'] 





1.2 Structuring Features According to Type 


We can generally classify features as either quantitative or qualitative. Quantitative features 
possess “numerical quantity”, such as height, age, number of births, etc., and can either be 
continuous or discrete. Continuous quantitative features take values in a continuous range 
of possible values, such as height, voltage, or crop yield; such features capture the idea 
that measurements can always be made more precisely. Discrete quantitative features have 
a countable number of possibilities, such as a count. 

In contrast, qualitative features do not have a numerical meaning, but their possible 
values can be divided into a fixed number of categories, such as {M,F} for gender or {blue, 
black, brown, green} for eye color. For this reason such features are also called categorical. 
A simple rule of thumb is: if it does not make sense to average the data, it is categorical. 
For example, it does not make sense to average eye colors. Of course it is still possible to 
represent categorical data with numbers, such as 1 = blue, 2 = black, 3 = brown, but such 
numbers carry no quantitative meaning. Categorical features are often called factors. 

When manipulating, summarizing, and displaying data, it is important to correctly spe- 
cify the type of the variables (features). We illustrate this using the nutrition_elderly 
data set from [73], which contains the results of a study involving nutritional measure- 
ments of thirteen features (columns) for 226 elderly individuals (rows). The data set can be 
obtained from: 


http://www.biostatisticien.eu/springeR/nutrition_elderly.xls. 


Excel files can be read directly into pandas via the read_excel method: 


QUANTITATIVE 


QUALITATIVE 


CATEGORICAL 


FACTORS 
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xls = 'http://www.biostatisticien.eu/springeR/nutrition_elderly.xls' 


nutri = pd.read_excel (xls) 





This creates a DataFrame object nutri. The first three rows are as follows: 


pd.set_option('display.max_columns', 8) # to fit display 
nutri.head (3) 


gender situation tea... cooked_fruit_veg chocol fat 
2 1 OS. 4 5 6 

1 eae 5 1 4 

1 O 2 5 4 


[3 rows x 13 columns] 





You can check the type (or structure) of the variables via the info method of nutri. 


<class 'pandas.core.frame.DataFrame '> 
RangeIndex: 226 entries, 9 to 225 
Data columns (total 13 columns): 
gender 226 non-null int64 
situation 226 non-null int64 
tea 226 non-null int64 
coffee 226 non-null int64 
height 226 non-null int64 
weight 226 non-null int64 
age 226 non-null int64 
meat 226 non-null int64 
fish 226 non-null int64 
raw_fruit 226 non-null int64 
cooked_fruit_veg 226 non-null int64 
chocol 226 non-null int64 
fat 226 non-null int64 
dtypes: int64(13) 
memory usage: 23.0 KB 





All 13 features in nutri are (at the moment) interpreted by Python as quantitative 
variables, indeed as integers, simply because they have been entered as whole numbers. 
The meaning of these numbers becomes clear when we consider the description of the 
features, given in Table 1.2. Table 1.1 shows how the variable types should be classified. 


Table 1.1: The feature types for the data frame nutri. 





Qualitative gender, situation, fat 

meat, fish, raw_fruit, cooked_fruit_veg, chocol 
Discrete quantitative tea, coffee 
Continuous quantitative | height, weight, age 








Note that the categories of the qualitative features in the second row of Table 1.1, meat, 
..., Chocol have a natural order. Such qualitative features are sometimes called ordinal, in 
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Table 1.2: Description of the variables in the nutritional study [73]. 












































Feature Description Unit or Coding 
gender Gender 1=Male; 2=Female 
1=Single 
situation Family status P= Living with spouse 
3=Living with family 
4=Living with someone else 
tea Daily consumption of tea Number of cups 
coffee Daily consumption of coffee Number of cups 
height Height cm 
weight Weight (actually: mass) kg 
age Age at date of interview Years 
O=Never 
1=Less than once a week 
meat Consumption of meat Onee a wee 
3=2-3 times a week 
4=4—6 times a week 
5=Every day 
fish Consumption of fish As in meat 
raw_fruit Consumption of raw fruits As in meat 
cooked_fruit_veg Consumption o! cooked As in meat 
fruits and vegetables 
chocol Consumption of chocolate As in meat 
1=Butter 
2=Margarine 
3=Peanut oil 
fat Type of fat used 4=Sunflower oil 
for cooking 5=Olive oil 


6=Mix of vegetable oils (e.g., Isio4) 
7=Colza oil 
8=Duck or goose fat 


contrast to qualitative features without order, which are called nominal. We will not make 
such a distinction in this book. 

We can modify the Python value and type for each categorical feature, using the 
replace and astype methods. For categorical features, such as gender, we can replace 
the value 1 with 'Male' and 2 with 'Female', and change the type to 'category' as 
follows. 


DICT = {1:'Male', 2:'Female'} # dictionary specifies replacement 


nutri['gender'] = nutri['gender'].replace(DICT).astype('category') 





The structure of the other categorical-type features can be changed in a similar way. 
Continuous features such as height should have type float: 


nutri['height'] = nutri['height'].astype(float) 


1.3. Summary Tables 





We can repeat this for the other variables (see Exercise 2) and save this modified data 
frame as a CSV file, by using the pandas method to_csv. 


nutri.to_csv('nutri.csv',index=False) 


1.3 Summary Tables 


It is often useful to summarize a large spreadsheet of data in a more condensed form. A 
table of counts or a table of frequencies makes it easier to gain insight into the underlying 
distribution of a variable, especially if the data are qualitative. Such tables can be obtained 
with the methods describe and value_counts. 

As a first example, we load the nutri DataFrame, which we restructured and saved 
(see previous section) as 'nutri.csv"', and then construct a summary for the feature 
(column) 'fat'. 


nutri = pd.read_csv('nutri.csv') 
nutri['fat'].describe() 


count 226 
unique 8 
top sunflower 
freq 68 
Name: fat, dtype: object 





We see that there are 8 different types of fat used and that sunflower has the highest 
count, with 68 out of 226 individuals using this type of cooking fat. The method 
value_counts gives the counts for the different fat types. 


nutri['fat'].value_counts() 


sunflower 

peanut 

olive 

margarine 

Isio4 

butter 

duck 

colza 1 

Name: fat, dtype: int64 


Column labels are also attributes of a DataFrame, and nutri. fat, for example, is 
exactly the same object as nutri['fat']. 
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It is also possible to use crosstab to cross tabulate between two or more variables, 
giving a contingency table: 


CROSS TABULATE 


pd.crosstab(nutri.gender, nutri.situation) 


situation Couple Family Single 


gender 
Female 56 7 78 
Male 63 2 20 





We see, for example, that the proportion of single men is substantially smaller than the 
proportion of single women in the data set of elderly people. To add row and column totals 
to a table, use margins=True. 


pd.crosstab(nutri.gender, nutri.situation, margins=True) 


situation Couple Family Single All 
gender 


Female 56 7 78 141 
Male 63 2 20 85 
All 119 9 98 226 





1.4 Summary Statistics 


In the following, x = [x,,...,x,]' is a column vector of n numbers. For our nutri data, 
the vector x could, for example, correspond to the heights of the n = 226 individuals. 


The sample mean of x, denoted by x, is simply the average of the data values: SAMPLE MEAN 
1 n 
s= Xi 
n 


Using the mean method in Python for the nutri data, we have, for instance: 


nutri['height'].mean() 


163 .96017699115043 





The p-sample quantile (0 < p < 1) of x is a value x such that at least a fraction p of the SAMPLE QUANTILE 
data is less than or equal to x and at least a fraction 1 — p of the data is greater than or equal 
to x. The sample median is the sample 0.5-quantile. The p-sample quantile is also called SAMPLE MEDIAN 
the 100 x p percentile. The 25, 50, and 75 sample percentiles are called the first, second, 
and third quartiles of the data. For the nutri data they are obtained as follows. QUARTILES 


nutri['height'].quantile(q=[0.25,0.5,0.75]) 
0.25 157.0 


0.50 163.0 
0.75 170.90 





1.5. Visualizing Data 





SAMPLE RANGE 
SAMPLE VARIANCE 


SAMPLE 
STANDARD 
DEVIATION 


ns 455 


The sample mean and median give information about the location of the data, while the 
distance between sample quantiles (say the 0.1 and 0.9 quantiles) gives some indication of 
the dispersion (spread) of the data. Other measures for dispersion are the sample range, 
max;x; — min;x;, the sample variance 


1 n 
ee ,-x) 1.1 
s m 3 (1.1) 


and the sample standard deviation s = Vs?. For the nutri data, the range (in cm) is: 


nutri['height'].max() - nutri['height'].min() 





The variance (in cm?) is: 


round(nutri['height'].var(), 2) # round to two decimal places 
81.06 


And the standard deviation can be found via: 


round(nutri['height'].std(Q), 2) 








We already encountered the describe method in the previous section for summarizing 
qualitative features, via the most frequent count and the number of unique elements. When 
applied to a quantitative feature, it returns instead the minimum, maximum, mean, and the 
three quartiles. For example, the 'height' feature in the nutri data has the following 
summary statistics. 


nutri['height'].describe() 


count 226.000000 
mean 163.960177 
std 9.003368 
min 140.000000 


25\% 157.000000 
50\% 163.000000 
75\% 170.000000 
max 188.000000 
Name: height, dtype: float64 





1.5 Visualizing Data 


In this section we describe various methods for visualizing data. The main point we would 
like to make is that the way in which variables are visualized should always be adapted to 
the variable types; for example, qualitative data should be plotted differently from quantit- 
ative data. 
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For the rest of this section, it is assumed that matplotlib.pyplot, pandas, and 
numpy, have been imported in the Python code as follows. 


import matplotlib.pyplot as plt 
import pandas as pd 
import numpy as np 





1.5.1 Plotting Qualitative Variables 


Suppose we wish to display graphically how many elderly people are living by themselves, 

as a couple, with family, or other. Recall that the data are given in the situation column 

of our nutri data. Assuming that we already restructured the data, as in Section 1.2, we i 3 
can make a barplot of the number of people in each category via the p1t.bar function of 
the standard matplot1lib plotting library. The inputs are the x-axis positions, heights, and 
widths of each bar respectively. 


BARPLOT 


width = 0.35 # the width of the bars 

x = [0, 0.8, 1.6] # the bar positions on x-axis 
situation_counts=nutri['situation'].value_counts() 
plt.bar(x, situation_counts, width, edgecolor = 'black') 
plt.xticks(x, situation_counts.index) 

plt.show() 





125 
100 
75 
50 
25 





Couple Single Family 


Figure 1.1: Barplot for the qualitative variable ‘situation’. 


1.5.2 Plotting Quantitative Variables 


We now present a few useful methods for visualizing quantitative data, again using the 
nutri data set. We will first focus on continuous features (e.g., 'age') and then add some 
specific graphs related to discrete features (e.g., 'tea'). The aim is to describe the variab- 
ility present in a single feature. This typically involves a central tendency, where observa- 
tions tend to gather around, with fewer observations further away. The main aspects of the 
distribution are the location (or center) of the variability, the spread of the variability (how 
far the values extend from the center), and the shape of the variability; e.g., whether or not 
values are spread symmetrically on either side of the center. 
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BOXPLOT 


HISTOGRAM 


1.5.2.1 Boxplot 


A boxplot can be viewed as a graphical representation of the five-number summary of 
the data consisting of the minimum, maximum, and the first, second, and third quartiles. 
Figure 1.2 gives a boxplot for the 'age' feature of the nutri data. 


plt.boxplot(nutri['age'],widths=width,vert=False) 


plt.xlabel('age') 
plt.show() 





The widths parameter determines the width of the boxplot, which is by default plotted 
vertically. Setting vert=False plots the boxplot horizontally, as in Figure 1.2. 


T 


65 70 75 80 85 90 
age 


Figure 1.2: Boxplot for 'age'. 


The box is drawn from the first quartile (Q1) to the third quartile (Q3). The vertical line 
inside the box signifies the location of the median. So-called “whiskers” extend to either 
side of the box. The size of the box is called the interquartile range: IQR = Q3 — Q1. The 
left whisker extends to the largest of (a) the minimum of the data and (b) Q; — 1.5 IQR. 
Similarly, the right whisker extends to the smallest of (a) the maximum of the data and 
(b) Q3 + 1.5IQR. Any data point outside the whiskers is indicated by a small hollow dot, 
indicating a suspicious or deviant point (outlier). Note that a boxplot may also be used for 
discrete quantitative features. 


1.5.2.2 Histogram 


A histogram is a common graphical representation of the distribution of a quantitative 
feature. We start by breaking the range of the values into a number of bins or classes. 
We tally the counts of the values falling in each bin and then make the plot by drawing 
rectangles whose bases are the bin intervals and whose heights are the counts. In Python 
we can use the function plt.hist. For example, Figure 1.3 shows a histogram of the 226 
ages in nutri, constructed via the following Python code. 


weights = np.ones_like(nutri.age)/nutri.age.count () 
plt.hist(nutri.age,bins=9,weights=weights,facecolor='cyan', 
edgecolor='black', linewidth=1) 


plt.xlabel('age') 
plt.ylabel('Proportion of Total') 
plt.show() 
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Here 9 bins were used. Rather than using raw counts (the default), the vertical axis 


here gives the percentage in each class, defined by ""*. This is achieved by choosing the 


“weights” parameter to be equal to the vector with entries 1/266, with length 226. Various 
plotting parameters have also been changed. 


0.20 


0.10 


Proportion of Total 


0.00 
90 





Figure 1.3: Histogram of 'age'. 


Histograms can also be used for discrete features, although it may be necessary to 
explicitly specify the bins and placement of the ticks on the axes. 


1.5.2.3 Empirical Cumulative Distribution Function 


The empirical cumulative distribution function, denoted by F,,, is a step function which 


EMPIRICAL 
jumps an amount k/n at observation values, where k is the number of tied observations CUMULATIVE 
at that value. For observations x),...,X,, F,(x) is the fraction of observations less than or igen 
equal to x, i.e., 

number of x;<x 1“ 
F,(x) = ————— = = )) 1 i <a), (1.2) 
n n 4 
i=l 
where 1 denotes the indicator function; that is, 1 {x; < x} is equal to 1 when x; < x and 0 INDICATOR 


otherwise. To produce a plot of the empirical cumulative distribution function we can use 
the plt.step function. The result for the age data is shown in Figure 1.4. The empirical 
cumulative distribution function for a discrete quantitative variable is obtained in the same 
way. 


np.sort(nutri.age) 
np.linspace(0,1,len(nutri.age)) 
-Xlabel(C'age') 


.ylabel ('Fn(x)') 

.Sstep (x,y) 
.Xlim(x.minQ) ,x.max(Q)) 
. show C) 
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1.0 


0.8 


0.6 


Fn(x) 


0.4 


0.2 


0.0 





65 70 75 80 85 90 
age 


Figure 1.4: Plot of the empirical distribution function for the continuous quantitative fea- 
ture 'age'. 


1.5.3 Data Visualization in a Bivariate Setting 


In this section, we present a few useful visual aids to explore relationships between two 
features. The graphical representation will depend on the type of the two features. 


1.5.3.1 Two-way Plots for Two Categorical Variables 


Comparing barplots for two categorical variables involves introducing subplots to the fig- 
ure. Figure 1.5 visualizes the contingency table of Section 1.3, which cross-tabulates the 
family status (situation) with the gender of the elderly people. It simply shows two barplots 
next to each other in the same figure. 


E Male 
[=] Female 


Counts 





Couple Family Single 


Figure 1.5: Barplot for two categorical variables. 
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The figure was made using the seaborn package, which was specifically designed to 
simplify statistical visualization tasks. 


import seaborn as sns 
sns.countplot(x='situation', hue = 'gender', data=nutri, 
hue_order = ['Male', 'Female'], palette = ['SkyBlue','Pink'], 
Saturation = 1, edgecolor='black') 


plt.legend(loc='upper center') 
plt.xlabel('') 
plt.ylabel('Counts') 
plt.show() 


1.5.3.2 Plots for Two Quantitative Variables 


We can visualize patterns between two quantitative features using a scatterplot. This can be 
done with plt.scatter. The following code produces a scatterplot of 'weight' against 
"height' for the nutri data. 


.scatter(nutri.height, nutri.weight, s=12, marker='o') 
.-Xlabel('height') 

-ylabel(C'weight') 

. show C) 





weight 





140 150 160 170 180 190 
height 


Figure 1.6: Scatterplot of 'weight' against 'height'. 


The next Python code illustrates that it is possible to produce highly sophisticated scat- 
ter plots, such as in Figure 1.7. The figure shows the birth weights (mass) of babies whose 
mothers smoked (blue triangles) or not (red circles). In addition, straight lines were fitted to 
the two groups, suggesting that birth weight decreases with age when the mother smokes, 
but increases when the mother does not smoke! The question is whether these trends are 
statistically significant or due to chance. We will revisit this data set later on in the book. 





SCATTERPLOT 


ns 199 
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urlprefix = 'https://vincentarelbundock.github.io/Rdatasets/csv/' 
dataname = 'MASS/birthwt.csv' 
bwt = pd.read_csv(urlprefix + dataname) 
bwt = bwt.drop('Unnamed: 0',1) #drop unnamed column 
styles = {07 Mior ned, 12) fs, pluen] 
for k in styles: 
grp = bwt[bwt.smoke==k] 
m,b = np.polyfit(grp.age, grp.bwt, 1) # fit a straight line 
plt.scatter(grp.age, grp.bwt, c=styles[k][1], s=15, linewidth=0, 
marker = styles[k][0]) 
plt.plot(grp.age, m*grp.age + b, '-', color=styles[k][1]) 


-Xlabel(C'age') 

-ylabel('birth weight (g)') 

. legend(['non-smokers', 'smokers'],prop={'size':8}, 
loc=(0.5,0.8)) 

. show (C) 
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Figure 1.7: Birth weight against age for smoking and non-smoking mothers. 


1.5.3.3 Plots for One Qualitative and One Quantitative Variable 


In this setting, it is interesting to draw boxplots of the quantitative feature for each level 
of the categorical feature. Assuming the variables are structured correctly, the function 
plt.boxplot can be used to produce Figure 1.8, using the following code: 


males = nutri[nutri.gender == 'Male'] 

females = nutri[nutri.gender == 'Female'] 

plt.boxplot([males.coffee, females.coffee] ,notch=True, widths 
=(0.5,0.5)) 


plt.xlabel('gender') 
plt.ylabel('coffee') 
plt.xticks([1,2],['Male','Female']) 
plt.show() 
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coffee 


Male Female 
gender 


Figure 1.8: Boxplots of a quantitative feature 'coffee' as a function of the levels of a 
categorical feature 'gender'. Note that we used a different, “notched”, style boxplot this 
time. 


Further Reading 


The focus in this book is on the mathematical and statistical analysis of data, and for the 
rest of the book we assume that the data is available in a suitable form for analysis. How- 
ever, a large part of practical data science involves the cleaning of data; that is, putting 
it into a form that is amenable to analysis with standard software packages. Standard Py- 
thon modules such as numpy and pandas can be used to reformat rows, rename columns, 
remove faulty outliers, merge rows, and so on. McKinney, the creator of pandas, gives 
many practical case studies in [84]. Effective data visualization techniques are beautifully 
illustrated in [65]. 


Exercises 


Before you attempt these exercises, make sure you have up-to-date versions of the relevant 
Python packages, specifically matplotlib, pandas, and seaborn. An easy way to ensure 
this is to update packages via the Anaconda Navigator, as explained in Appendix D. 


1. Visit the UCI Repository https: //archive.ics.uci.edu/. Read the description of 
the data and download the Mushroom data set agaricus-lepiota.data. Using pandas, 
read the data into a DataFrame called mushroom, via read_csv. 


(a) How many features are in this data set? 


(b) What are the initial names and types of the features? 


(c) Rename the first feature (index 0) to 'edibility' and the sixth feature (index 5) to 
‘odor' [Hint: the column names in pandas are immutable; so individual columns 
cannot be modified directly. However it is possible to assign the entire column names 
list via mushroom. columns = newcols. ] 
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is 2 


(d) The 6th column lists the various odors of the mushrooms: encoded as 'a', 'c', 
Replace these with the names 'almond', 'creosote’, etc. (categories correspond- 
ing to each letter can be found on the website). Also replace the 'edibility' cat- 
egories 'e' and 'p' with 'edible' and 'poisonous'’. 

(e) Make a contingency table cross-tabulating 'edibility' and 'odor'. 


(f) Which mushroom odors should be avoided, when gathering mushrooms for consump- 
tion? 


(g) What proportion of odorless mushroom samples were safe to eat? 
2. Change the type and value of variables in the nutri data set according to Table 1.2 and 


save the data as a CSV file. The modified data should have eight categorical features, three 
floats, and two integer features. 


3. It frequently happens that a table with data needs to be restructured before the data can 
be analyzed using standard statistical software. As an example, consider the test scores in 
Table 1.3 of 5 students before and after specialized tuition. 


Table 1.3: Student scores. 





Student Before After 
1 75 85 
2 30 50 
3 100 100 
4 50 52 
5 60 65 


This is not in the standard format described in Section 1.1. In particular, the student scores 
are divided over two columns, whereas the standard format requires that they are collected 
in one column, e.g., labelled 'Score'. Reformat (by hand) the table in standard format, 
using three features: 


e 'Score', taking continuous values, 


e 'Time', taking values 'Before' and 'After', 


e 'Student', taking values from 1 to 5. 
Useful methods for reshaping tables in pandas are melt, stack, and unstack. 


4. Create a similar barplot as in Figure 1.5, but now plot the corresponding proportions of 
males and females in each of the three situation categories. That is, the heights of the bars 
should sum up to 1 for both barplots with the same ’ gender’ value. [Hint: seaborn does 
not have this functionality built in, instead you need to first create a contingency table and 
use matplotlib.pyplot to produce the figure. ] 


5. The iris data set, mentioned in Section 1.1, contains various features, including 
"Petal.Length' and 'Sepal.Length', of three species of iris: setosa, versicolor, and 
virginica. 
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(a) Load the data set into a pandas DataFrame object. 

(b) Using matplotlib.pyplot, produce boxplots of 'Petal.Length' for each the 
three species, in one figure. 

(c) Make a histogram with 20 bins for 'Petal.Length'. 

(d) Produce a similar scatterplot for 'Sepal.Length' against 'Petal.Length' to that 
of the left plot in Figure 1.9. Note that the points should be colored according to the 
’ Species’ feature as per the legend in the right plot of the figure. 

(e) Using the kdeplot method of the seaborn package, reproduce the right plot of 
Figure 1.9, where kernel density plots for 'Petal.Length' are given. 


— setosa 
— versicolor 
2.0 4 — virginica 


Sepal.Length 

















Petal.Length Petal.Length 


Figure 1.9: Left: scatterplot of 'Sepal.Length' against 'Petal.Length'. Right: kernel 
density estimates of 'Petal.Length' for the three species of iris. 


6. Import the data set EuStockMarkets from the same website as the iris data set above. 
The data set contains the daily closing prices of four European stock indices during the 
1990s, for 260 working days per year. 


(a) Create a vector of times (working days) for the stock prices, between 1991.496 and 
1998.646 with increments of 1/260. 


(b) Reproduce Figure 1.10. [Hint: Use a dictionary to map column names (stock indices) 
to colors. ] 
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Figure 1.10: Closing stock indices for various European stock markets. 
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7. Consider the KASANDR data set from the UCI Machine Learning Repository, which can 
be downloaded from 


https://archive.ics.uci.edu/ml/machine-learning-databases/00385/de 
.tar.bz2. 


This archive file has a size of 900Mb, so it may take a while to download. Uncompressing 
the file (e.g., via 7-Zip) yields a directory de containing two large CSV files: test_de.csv 
and train_de.csv, with sizes 372Mb and 3Gb, respectively. Such large data files can still 
be processed efficiently in pandas, provided there is enough memory. The files contain 
records of user information from Kelkoo web logs in Germany as well as meta-data on 
users, offers, and merchants. The data sets have 7 attributes and 1919561 and 15844717 
rows, respectively. The data sets are anonymized via hex strings. 


(a) Load train_de.csv into a pandas DataFrame object de, using 
read_csv('train_de.csv', delimiter = '\t'). 


If not enough memory is available, load test_de.csv instead. Note that entries are 
separated here by tabs, not commas. Time how long it takes for the file to load, using 
the time package. (It took 38 seconds for train_de.csv to load on one of our 
computers.) 


(b) How many unique users and merchants are in this data set? 


8. Visualizing data involving more than two features requires careful design, which is often 
more of an art than a science. 


(a) Go to Vincent Arel-Bundocks’s website (URL given in Section 1.1) and read the 
Orange data set into a pandas DataFrame object called orange. Remove its first 
(unnamed) column. 


(b) The data set contains the circumferences of 5 orange trees at various stages in their 
development. Find the names of the features. 


(c) In Python, import seaborn and visualize the growth curves (circumference against 
age) of the trees, using the regplot and FacetGrid methods. 


CHAPTER 2 





STATISTICAL LEARNING 





The purpose of this chapter is to introduce the reader to some common concepts 
and themes in statistical learning. We discuss the difference between supervised and 
unsupervised learning, and how we can assess the predictive performance of supervised 
learning. We also examine the central role that the linear and Gaussian properties play 
in the modeling of data. We conclude with a section on Bayesian learning. The required 
probability and statistics background is given in Appendix C. 


2.1 Introduction 


Although structuring and visualizing data are important aspects of data science, the main 
challenge lies in the mathematical analysis of the data. When the goal is to interpret the 
model and quantify the uncertainty in the data, this analysis is usually referred to as stat- 
istical learning. In contrast, when the emphasis is on making predictions using large-scale 
data, then it is common to speak about machine learning or data mining. 

There are two major goals for modeling data: 1) to accurately predict some future 
quantity of interest, given some observed data, and 2) to discover unusual or interesting 
patterns in the data. To achieve these goals, one must rely on knowledge from three im- 
portant pillars of the mathematical sciences. 


Function approximation. Building a mathematical model for data usually means under- 
standing how one data variable depends on another data variable. The most natural 
way to represent the relationship between variables is via a mathematical function or 
map. We usually assume that this mathematical function is not completely known, 
but can be approximated well given enough computing power and data. Thus, data 
scientists have to understand how best to approximate and represent functions using 
the least amount of computer processing and memory. 


Optimization. Given a class of mathematical models, we wish to find the best possible 
model in that class. This requires some kind of efficient search or optimization pro- 
cedure. The optimization step can be viewed as a process of fitting or calibrating 
a function to observed data. This step usually requires knowledge of optimization 
algorithms and efficient computer coding or programming. 
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2.2. Supervised and Unsupervised Learning 
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Probability and Statistics. In general, the data used to fit the model is viewed as a realiz- 
ation of a random process or numerical vector, whose probability law determines the 
accuracy with which we can predict future observations. Thus, in order to quantify 
the uncertainty inherent in making predictions about the future, and the sources of er- 
ror in the model, data scientists need a firm grasp of probability theory and statistical 
inference. 


2.2 Supervised and Unsupervised Learning 


Given an input or feature vector x, one of the main goals of machine learning is to predict 
an output or response variable y. For example, x could be a digitized signature and y a 
binary variable that indicates whether the signature is genuine or false. Another example is 
where x represents the weight and smoking habits of an expecting mother and y the birth 
weight of the baby. The data science attempt at this prediction is encoded in a mathematical 
function g, called the prediction function, which takes as an input x and outputs a guess g(x) 
for y (denoted by J, for example). In a sense, g encompasses all the information about the 
relationship between the variables x and y, excluding the effects of chance and randomness 
in nature. 

In regression problems, the response variable y can take any real value. In contrast, 
when y can only lie in a finite set, say y € {0,...,c — 1}, then predicting y is conceptually 
the same as classifying the input x into one of c categories, and so prediction becomes a 
classification problem. 

We can measure the accuracy of a prediction y with respect to a given response y by 
using some loss function Loss(y, y). In a regression setting the usual choice is the squared- 
error loss (y—y)°. In the case of classification, the zero—one (also written 0-1) loss function 
Loss(y, Y) = 1{y # y} is often used, which incurs a loss of 1 whenever the predicted class 
Y is not equal to the class y. Later on in this book, we will encounter various other useful 
loss functions, such as the cross-entropy and hinge loss functions (see, e.g., Chapter 7). 


The word error is often used as a measure of distance between a “true” object y and 
some approximation y thereof. If y is real-valued, the absolute error |y — y] and the 


squared error (y—y)* are both well-established error concepts, as are the norm ||y—y]| 
and squared norm ||y —y]|? for vectors. The squared error (y —y)? is just one example 
of a loss function. 





RISK 


It is unlikely that any mathematical function g will be able to make accurate predictions 
for all possible pairs (x, y) one may encounter in Nature. One reason for this is that, even 
with the same input x, the output y may be different, depending on chance circumstances 
or randomness. For this reason, we adopt a probabilistic approach and assume that each 
pair (x,y) is the outcome of a random pair (X, Y) that has some joint probability density 
f(x,y). We then assess the predictive performance via the expected loss, usually called the 
risk, for g: 

&(g) = ELoss(¥, g(X)). (2.1) 


For example, in the classification case with zero—one loss function the risk is equal to the 
probability of incorrect classification: f(g) = PLY + g(X)]. In this context, the prediction 
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function g is called a classifier. Given the distribution of (X, Y) and any loss function, we 
can in principle find the best possible g* := argmin, E Loss(Y, g(X)) that yields the smallest 
risk €* := €(g"). We will see in Chapter 7 that in the classification case with y € {0,...,c—1} 
and f(g) = P[Y + g(X)], we have 


g (x) = argmax f(y|x), 
ye{0,...,c—1} 
where f(y|x) = PLY = y|X = x] is the conditional probability of Y = y given X = x. 
As already mentioned, for regression the most widely-used loss function is the squared- 
error loss. In this setting, the optimal prediction function g* is often called the regression 
junction. The following theorem specifies its exact form. 


Theorem 2.1: Optimal Prediction Function for Squared-Error Loss 





Proof: Let g*(x) = ELY | X = x]. For any function g, the squared-error risk satisfies 
E(Y — g(X))’ = EL(Y - g*(X) + g*(X) - 9(X))] 
= E(Y — g (XY + 2E[(Y - g°(X))(g"(X) — g(X))] + Elg“ X) - 8X)? 
> EY — g (XY + 2EL(Y - g (DX) - 9(X))] 
= E(Y — g°(X))” + 2E {(g"(X) - g(X))ELY - g°(X)| XI}. 


In the last equation we used the tower property. By the definition of the conditional expect- 
ation, we have E[Y — g*(X) | X] = 0. It follows that E(Y — g(X)} > E(Y — g*(X))}?, showing 
that g* yields the smallest squared-error risk. o 


One consequence of Theorem 2.1 is that, conditional on X = x, the (random) response 
Y can be written as 
Y = g* (x) + e(x), (2.2) 


where e(x) can be viewed as the random deviation of the response from its conditional 
mean at x. This random deviation satisfies E e(x) = 0. Further, the conditional variance of 
the response Y at x can be written as Var e(x) = v’(x) for some unknown positive function 
v. Note that, in general, the probability distribution of ¢(x) is unspecified. 

Since, the optimal prediction function g* depends on the typically unknown joint distri- 
bution of (X, Y), it is not available in practice. Instead, all that we have available is a finite 
number of (usually) independent realizations from the joint density f(x, y). We denote this 
sample by T = {(X1, Y\),...,(Xn, Yn)} and call it the training set (T is a mnemonic for 
training) with n examples. It will be important to distinguish between a random training 
set J and its (deterministic) outcome {(x1, y1), - -< , (Xn, Yn)}. We will use the notation t for 
the latter. We will also add the subscript n in tT, when we wish to emphasize the size of the 
training set. 

Our goal is thus to “learn” the unknown g* using the n examples in the training set T. 
Let us denote by gy the best (by some criterion) approximation for g* that we can construct 
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from T. Note that gy is a random function. A particular outcome is denoted by g+. It is 
often useful to think of a teacher—learner metaphor, whereby the function gy is a learner 
who learns the unknown functional relationship g* : x + y from the training data 7. We 
can imagine a “teacher” who provides n examples of the true relationship between the 
output Y; and the input X; for i = 1,...,n, and thus “trains” the learner gz to predict the 
output of a new input X, for which the correct output Y is not provided by the teacher (is 
unknown). 

The above setting is called supervised learning, because one tries to learn the functional 
relationship between the feature vector x and response y in the presence of a teacher who 
provides n examples. It is common to speak of “explaining” or predicting y on the basis of 
x, where x is a vector of explanatory variables. 

An example of supervised learning is email spam detection. The goal is to train the 
learner gy to accurately predict whether any future email, as represented by the feature 
vector x, is spam or not. The training data consists of the feature vectors of a number 
of different email examples as well as the corresponding labels (spam or not spam). For 
instance, a feature vector could consist of the number of times sales-pitch words like “free”, 
“sale”, or “miss out” occur within a given email. 

As seen from the above discussion, most questions of interest in supervised learning 
can be answered if we know the conditional pdf f(y |x), because we can then in principle 
work out the function value g*(x). 

In contrast, unsupervised learning makes no distinction between response and explan- 
atory variables, and the objective is simply to learn the structure of the unknown distribu- 
tion of the data. In other words, we need to learn f(x). In this case the guess g(x) is an 
approximation of f(x) and the risk is of the form 


&(g) = ELoss(f(X), g(X)). 


An example of unsupervised learning is when we wish to analyze the purchasing be- 
haviors of the customers of a grocery shop that has a total of, say, a hundred items on sale. 
A feature vector here could be a binary vector x € {0, 1}! representing the items bought 
by a customer on a visit to the shop (a 1 in the k-th position if a customer bought item 
k € {1,..., 100} and a 0 otherwise). Based on a training set T = {x1,..., Xn}, we wish to 
find any interesting or unusual purchasing patterns. In general, it is difficult to know if an 
unsupervised learner is doing a good job, because there is no teacher to provide examples 
of accurate predictions. 

The main methodologies for unsupervised learning include clustering, principal com- 
ponent analysis, and kernel density estimation, which will be discussed in Chapter 4. 

In the next three sections we will focus on supervised learning. The main super- 
vised learning methodologies are regression and classification, to be discussed in detail in 
Chapters 5 and 7. More advanced supervised learning techniques, including reproducing 
kernel Hilbert spaces, tree methods, and deep learning, will be discussed in Chapters 6, 8, 
and 9. 
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2.3 Training and Test Loss 


Given an arbitrary prediction function g, it is typically not possible to compute its risk f(g) 
in (2.1). However, using the training sample 7, we can approximate f(g) via the empirical 
(sample average) risk 


1 n 
fr(g) = — X Loss; 8(X;)), (2.3) 
i=l 


which we call the training loss. The training loss is thus an unbiased estimator of the risk 
(the expected loss) for a prediction function g, based on the training data. 

To approximate the optimal prediction function g* (the minimizer of the risk €(g)) we 
first select a suitable collection of approximating functions G and then take our learner to 
be the function in G that minimizes the training loss; that is, 


ge = argmin {y (e). (2.4) 


8G 


For example, the simplest and most useful G is the set of linear functions of x; that is, the 
set of all functions g : x > B'x for some real-valued vector £. 

We suppress the superscript G when it is clear which function class is used. Note that 
minimizing the training loss over all possible functions g (rather than over all g € G) does 
not lead to a meaningful optimization problem, as any function g for which g(X;) = Y; for 
all i gives minimal training loss. In particular, for a squared-error loss, the training loss will 
be 0. Unfortunately, such functions have a poor ability to predict new (that is, independent 
from 7) pairs of data. This poor generalization performance is called overfitting. 


By choosing g a function that predicts the training data exactly (and is, for example, 
0 otherwise), the squared-error training loss is zero. Minimizing the training loss is 


not the ultimate goal! 





The prediction accuracy of new pairs of data is measured by the generalization risk of 
the learner. For a fixed training set T it is defined as 


€(g%) = ELoss(¥, g9(X)), (2.5) 


where (X, Y) is distributed according to f(x,y). In the discrete case the generalization risk 
is therefore: eE ) = Yxy Loss, gE (x)) f(x,y) (replace the sum with an integral for the 
continuous case). The situation is illustrated in Figure 2.1, where the distribution of (X, Y) 
is indicated by the red dots. The training set (points in the shaded regions) determines a 
fixed prediction function shown as a straight line. Three possible outcomes of (X, Y) are 
shown (black dots). The amount of loss for each point is shown as the length of the dashed 
lines. The generalization risk is the average loss over all possible pairs (x, y), weighted by 
the corresponding f(x,y). 
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eee 
Figure 2.1: The generalization risk for a fixed training set is the weighted-average loss over 
all possible pairs (x, y). 
For a random training set 7, the generalization risk is thus a random variable that 
depends on 7 (and G). If we average the generalization risk over all possible instances of 
EXPECTED J , we obtain the expected generalization risk: 
GENERALIZATION 
RISK 


TEST SAMPLE 


TEST LOSS 


E &(g%) = ELoss(¥, 8} (X)), (2.6) 
where (X, Y) in the expectation above is independent of 7. In the discrete case, we have 


BE(SZ) = Erys yin enyn LOSS, ZOS, YF (11) “+f Fn, Yn). Figure 2.2 gives an il- 
lustration. 











Figure 2.2: The expected generalization risk is the weighted-average loss over all possible 
pairs (x, y) and over all training sets. 


For any outcome T of the training data, we can estimate the generalization risk without 
bias by taking the sample average 


i= 
f(g) := = X Loss(¥}, 8? (X0), (2.7) 
i=1 


where {(X1, Y;),...,(X), Y/,)} =: T” is a so-called test sample. The test sample is com- 
pletely separate from 7, but is drawn in the same way as 7; that is, via independent draws 
from f(x,y), for some sample size n’. We call the estimator (2.7) the test loss. For a ran- 
dom training set 7 we can define trg) similarly. It is then crucial to assume that 7 is 
independent of 7”. Table 2.1 summarizes the main definitions and notation for supervised 
learning. 
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Table 2.1: Summary of definitions for supervised learning. 


Fixed explanatory (feature) vector. 
Random explanatory (feature) vector. 
Fixed (real-valued) response. 
Random response. 
Joint pdf of X and Y, evaluated at (x, y). 
Conditional pdf of Y given X = x, evaluated at y. 
Fixed training data {(x;, y;) i = 1,...,n}. 
Random training data {(X;, Y;),i = 1,...,n}. 
Matrix of explanatory variables, with n rows x/,i = 1,...,n 
and dim(x) feature columns; one of the features may be the 
constant 1. 
Vector of response variables (y,,...,yn)'. 
Prediction (guess) function. 

Loss(y, y) Loss incurred when predicting response y with y. 

&(g) Risk for prediction function g; that is, E Loss(Y, g(X)). 

g Optimal prediction function; that is, argmin, f(g). 
Optimal prediction function in function class G; that is, 
argmin,g €(g). 
Training loss for prediction function g; that is, the sample av- 
erage estimate of f(g) based on a fixed training sample T. 
The same as ¢,(g), but now for a random training sample 7. 
The learner: argmin,-¢ €;(g). That is, the optimal prediction 
function based on a fixed training set t and function class G. 
We suppress the superscript G if the function class is implicit. 
The learner, where we have replaced t with a random training 
set T. 





To compare the predictive performance of various learners in the function class G, as 
measured by the test loss, we can use the same fixed training set t and test set 7’ for all 
learners. When there is an abundance of data, the “overall” data set is usually (randomly) 
divided into a training and test set, as depicted in Figure 2.3. We then use the training data 
to construct various learners gE Li gE *,..., and use the test data to select the best (with the 
smallest test loss) among these learners. In this context the test set is called the validation 
set. Once the best learner has been chosen, a third “test”? set can be used to assess the VALIDATION SET 
predictive performance of the best learner. The training, validation, and test sets can again 
be obtained from the overall data set via a random allocation. When the overall data set 
is of modest size, it is customary to perform the validation phase (model selection) on the 
training set only, using cross-validation. This is the topic of Section 2.5.2. is 37 
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Figure 2.3: Statistical learning algorithms often require the data to be divided into training 
and test data. If the latter is used for model selection, a third set is needed for testing the 
performance of the selected model. 


We next consider a concrete example that illustrates the concepts introduced so far. 


E Example 2.1 (Polynomial Regression) In what follows, it will appear that we have ar- 
bitrarily replaced the symbols x, g, G with u, h, H, respectively. The reason for this switch 
of notation will become clear at the end of the example. 

The data (depicted as dots) in Figure 2.4 are n = 100 points (u;, y;),i = 1,..., drawn 
from iid random points (U;, Y;),i = 1,...,n, where the {U;} are uniformly distributed on 
the interval (0, 1) and, given U; = u;, the random variable Y; has a normal distribution with 
expectation 10 — 140u; + 400u? — 250u; and variance ¢* = 25. This is an example of a 
polynomial regression model. Using a squared-error loss, the optimal prediction function 
h*(u) = E[Y| U = u] is thus 


h*(u) = 10 — 140u + 400u? — 250u°, 


which is depicted by the dashed curve in Figure 2.4. 


e data points 
=== true 
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Figure 2.4: Training data and the optimal polynomial prediction function h*. 
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To obtain a good estimate of h*(u) based on the training set t = {(u;, y,),i = 1,...,n}, 
we minimize the outcome of the training loss (2.3): 


1 n 
Lh) = = X 0i- hu’, (2.8) 
i=1 


over a suitable set H of candidate functions. Let us take the set H, of polynomial functions 
in u of order p — 1: 
h(u) := By + Bou + B3u +--+ + Bu?! (2.9) 


for p = 1,2, ... and parameter vector B = [6),62,...,8,]'. This function class contains the 
best possible h*(u) = E[Y | U = u] for p > 4. Note that optimization over H, is a parametric 
optimization problem, in that we need to find the best 8. Optimization of (2.8) over H, is 
not straightforward, unless we notice that (2.9) is a linear function in £. In particular, if we 
map each feature u to a feature vector x = [1, u, u’,...,u?-']", then the right-hand side of 
(2.9) can be written as the function 


g(x) =x'B, 


which is linear in x (as well as £). The optimal h*(u) in H, for p > 4 then corresponds 
to the function g*(x) = x'f" in the set G, of linear functions from R? to R, where B* = 
[10, -140, 400, —250,0,...,0]". Thus, instead of working with the set H, of polynomial 
functions we may prefer to work with the set G, of linear functions. This brings us to a 
very important idea in statistical learning: 


p . . . . 5 
RON Expand the feature space to obtain a linear prediction function. 


=x 


Let us now reformulate the learning problem in terms of the new explanatory (feature) 


: =] ; x 3 
variables x; = [1, ui, u, ji u? ]', i = 1,...,n. It will be convenient to arrange these 
feature vectors into a matrix X with rows x],...,%,: 
-1 
l um Ñ o ub 
-1 
l u w o ub 
X= (2.10) 
2 p-1 
l iy UW e 46, 


Collecting the responses {y;} into a column vector y, the training loss (2.3) can now be 
written compactly as 


1 
D- XBll’. (2.11) 


To find the optimal learner (2.4) in the class G, we need to find the minimizer of (2.11): 


B= argmin ly - XBll’, (2.12) 


which is called the ordinary least-squares solution. As is illustrated in Figure 2.5, to find B, ORDINARY 
we choose Xf to be equal to the orthogonal projection of y onto the linear space spanned errr 
by the columns of the matrix X; that is, X8 = Py, where P is the projection matrix. PROJECTION 

MATRIX 
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Span(X) 
Figure 2.5: XB is the orthogonal projection of y onto the linear space spanned by the 
columns of the matrix X. 
iS 362 According to Theorem A.4, the projection matrix is given by 
P=XX*, (2.13) 
t= 360 where the p x n matrix X* in (2.13) is the pseudo-inverse of X. If X happens to be of full 
PSEUDO-INVERSE column rank (so that none of the columns can be expressed as a linear combination of the 
mS 356 other columns), then X* = (XTX) XT. n 
In any case, from X$ = Py and PX = X, we can see that £ satisfies the normal 
NORMAL equations: 
EQUATIONS 


X™X = X'Py = (PX)'y = X'y. (2.14) 


This is a set of linear equations, which can be solved very fast and whose solution can be 


written explicitly as: _ 
B=X*y. (2.15) 


Figure 2.6 shows the trained learners for various values of p: 


hi” (u) = 8 Œ) = x"B 


e data points 
= = true 
p=2, underfit 
p = 4, correct 
p= 16, overfit 


h¥(u) 
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Figure 2.6: Training data with fitted curves for p = 2,4, and 16. The true cubic polynomial 
curve for p = 4 is also plotted (dashed line). 
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We see that for p = 16 the fitted curve lies closer to the data points, but is further away 
from the dashed true polynomial curve, indicating that we overfit. The choice p = 4 (the 
true cubic polynomial) is much better than p = 16, or indeed p = 2 (straight line). 


Each function class G, gives a different learner g7”, p = 1,2,.... To assess which is 
better, we should not simply take the one that gives the smallest training loss. We can 
always get a zero training loss by taking p = n, because for any set of n points there exists 
a polynomial of degree n — | that interpolates all points! 

Instead, we assess the predictive performance of the learners using the test loss (2.7), 
computed from a test data set. If we collect all n’ test feature vectors in a matrix X’ and 
the corresponding test responses in a vector y’, then, similar to (2.11), the test loss can be 
written compactly as 


1 ri ID 
lp(ge?) = = lly’ - XA, 


where B is given by (2.15), using the training data. 

Figure 2.7 shows a plot of the test loss against the number of parameters in the vector 
£; that is, p. The graph has a characteristic “bath-tub” shape and is at its lowest for p = 4, 
correctly identifying the polynomial order 3 for the true model. Note that the test loss, as 
an estimate for the generalization risk (2.7), becomes numerically unreliable after p = 16 
(the graph goes down, where it should go up). The reader may check that the graph for 
the training loss exhibits a similar numerical instability for large p, and in fact fails to 
numerically decrease to 0 for large p, contrary to what it should do in theory. The numerical 
problems arise from the fact that for large p the columns of the (Vandermonde) matrix X 
are of vastly different magnitudes and so floating point errors quickly become very large. 

Finally, observe that the lower bound for the test loss is here around 21, which corres- 
ponds to an estimate of the minimal (squared-error) risk ¢* = 25. 
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Figure 2.7: Test loss as function of the number of parameters p of the model. 


This script shows how the training data were generated and plotted in Python: 


2.3. Training and Test Loss 





polyreg1l.py 


import numpy as np 
from numpy.random import rand , randn 
from numpy.linalg import norm , solve 
import matplotlib.pyplot as plt 
def generate_data(beta , sig, n): 
u = np.random.rand(n, 1) 
= (u ** np.arange(0, 4)) @ beta + sig * np.random.randn(n, 1) 
return u, y 


np.random.seed(12) 


beta = np.array([[10, -140, 400, -250]]).T 
n = 100 
sig = 5 
u, y = generate_data(beta , sig, n) 
np.arange(np.min(u), np.max(u)+5e-3, 5e-3) 
np.polyval(np.flip(beta), xx) 
.plot(u, y, '.', markersize=8) 
-plot(xx, yy, ‘'--',linewidth=3) 
-Xlabel(r'$u$') 
.ylabel(r'$h^*(u)$') 
.legend(['data points','true']) 
. show () 





The following code, which imports the code above, fits polynomial models with p = 
1,...,K = 18 parameters to the training data and plots a selection of fitted curves, as 
shown in Figure 2.6. 


polyreg2.py 


from polyregl import * 





max_p = 18 

p_range = np.arange(1, max_p + 1, 1) 
X = np.ones((n, 1)) 

betahat, trainloss = {}, {} 


for p in p_range: # p is the number of parameters 
if p> i: 
X = np.hstack((X, u**(p-1))) # add column to matrix 


betahat[p] = solve(X.T @ X, X.T @ y) 
trainloss[p] = (norm(y - X @ betahat[p])**2/n) 


p = [2, 4, 16] # select three curves 


#replot the points and true line and store in the list "plots" 
plots = [plt.plot(u, y, 'k.', markersize=8) [0], 
plt.plot(xx, yy, 'k--',linewidth=3)[0]] 
# add the three curves 
for i iñ p: 
yy = np.polyval(np.flip(betahat[i]), xx) 
plots.append(plt.plot(xx, yy)[09]) 
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plt.xlabel(r'$u$') 

plt.ylabel(r'$h4{\mathcal{H}_p}_{\tau}(u)$') 

plt.legend(plots,('data points', 'true','$p=2$, underfit', 
"$p=4$, correct','$p=16$, overfit')) 

plt.savefig('polyfitpy.pdf',format='pdf') 

plt.show() 





The last code snippet which imports the previous code, generates the test data and plots the 
graph of the test loss, as shown in Figure 2.7. 


polyreg3.py 
from polyreg2 import * 


# generate test data 
u_test, y_test = generate_data(beta, sig, n) 


MSPS] 
X_test = np.ones((n, 1)) 


for p in p_range: 
ifp i 
X_test = np.hstack((X_test, u_test**(p-1))) 


y_hat = X_test @ betahat[p] # predictions 
MSE . append (np . sum((y_test - y_hat)**2/n)) 


.plot(p_range, MSE, 'b', p_range, MSE, 'bo') 
.Xticks(ticks=p_range) 

.Xlabel('Number of parameters $p$') 
-ylabel('Test loss') 





2.4 Tradeoffs in Statistical Learning 


The art of machine learning in the supervised case is to make the generalization risk (2.5) 
or expected generalization risk (2.6) as small as possible, while using as few computational 
resources as possible. In pursuing this goal, a suitable class G of prediction functions has 
to be chosen. This choice is driven by various factors, such as 


e the complexity of the class (e.g., is it rich enough to adequately approximate, or even 
contain, the optimal prediction function g*?), 


e the ease of training the learner via the optimization program (2.4), 
e how accurately the training loss (2.3) estimates the risk (2.1) within class G, 
e the feature types (categorical, continuous, etc.). 


As a result, the choice of a suitable function class G usually involves a tradeoff between 
conflicting factors. For example, a learner from a simple class G can be trained very 
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quickly, but may not approximate g* very well, whereas a learner from a rich class G 
that contains g* may require a lot of computing resources to train. 

To better understand the relation between model complexity, computational simplicity, 
and estimation accuracy, it is useful to decompose the generalization risk into several parts, 
so that the tradeoffs between these parts can be studied. We will consider two such decom- 
positions: the approximation—estimation tradeoff and the bias—variance tradeoff. 

We can decompose the generalization risk (2.5) into the following three components: 


U= C + Ug- +e) - &(g%), (2.16) 


irreducible risk approximation error statistical error 


where ¢* := €(g*) is the irreducible risk and g@ := argmin,-g €(g) is the best learner within 
class G. No learner can predict a new response with a smaller risk than £. 

The second component is the approximation error; it measures the difference between 
the irreducible risk and the best possible risk that can be obtained by selecting the best 
prediction function in the selected class of functions G. Determining a suitable class G and 
minimizing f(g) over this class is purely a problem of numerical and functional analysis, 
as the training data Tt are not present. For a fixed G that does not contain the optimal g*, the 
approximation error cannot be made arbitrarily small and may be the dominant component 
in the generalization risk. The only way to reduce the approximation error is by expanding 
the class G to include a larger set of possible functions. 

The third component is the statistical (estimation) error. It depends on the training 
set T and, in particular, on how well the learner gE estimates the best possible prediction 
function, g%, within class G. For any sensible estimator this error should decay to zero (in 
probability or expectation) as the training size tends to infinity. 

The approximation—estimation tradeoff pits two competing demands against each 
other. The first is that the class G has to be simple enough so that the statistical error is 
not too large. The second is that the class G has to be rich enough to ensure a small approx- 
imation error. Thus, there is a tradeoff between the approximation and estimation errors. 

For the special case of the squared-error loss, the generalization risk is equal to (e£) = 
E(Y - gE (X))’; that is, the expected squared error! between the predicted value gf (X) 
and the response Y. Recall that in this case the optimal prediction function is given by 
g(x) = ELY | X = x]. The decomposition (2.16) can now be interpreted as follows. 


1. The first component, £* = E(Y — g*(X))’, is the irreducible error, as no prediction 
function will yield a smaller expected squared error. 


2. The second component, the approximation error £(g%) — &(g*), is equal to E(g9(X) — 
g*(X))*. We leave the proof (which is similar to that of Theorem 2.1) as an exercise; 
see Exercise 2. Thus, the approximation error (defined as a risk difference) can here 
be interpreted as the expected squared error between the optimal predicted value and 
the optimal predicted value within the class G. 


3. For the third component, the statistical error, f(g? ) — €(g9%) there is no direct inter- 
pretation as an expected squared error unless G is the class of linear functions; that 
is, g(x) = x'B for some vector $. In this case we can write (see Exercise 3) the 
statistical error as €(g2%) — €(g9%) = E(g9(X) — @9(X)). 





'Colloquially called mean squared error. 
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Thus, when using a squared-error loss, the generalization risk for a linear class G can 
be decomposed as: 


Ll8) = E(g9(X) — YY = & + E(g9(X) - g (XY + E(e9(X) - 9%(X))’. (2.17) 
— IOS aua 
approximation error statistical error 


Note that in this decomposition the statistical error is the only term that depends on the 
training set. 


E Example 2.2 (Polynomial Regression (cont.)) We continue Example 2.1. Here G = 
Gp is the class of linear functions of x = [1,u,u’,...,uw?-']", and g*(x) = xB". Condi- 
tional on X = x we have that Y = g*(x) + E(x), with e(x) ~ N(0, £), where & = E(Y - 
g*(X)) = 25 is the irreducible error. We wish to understand how the approximation and 
statistical errors behave as we change the complexity parameter p. 

First, we consider the approximation error. Any function g € G, can be written as 


g(x) = A(u) = Bı + Bou t+ +--+ Bpu”™! =[1,u,...,u "1B, 


and so g(X) is distributed as [1,U,...,U?~']B, where U ~ U(0,1). Similarly, g*(X) is 
distributed as [1, U, U7, U*]B*. It follows that an expression for the approximation error 


. cl *\2 a : ; 
is: Í (i, u,...,u”™!] B -— [1,u, u", u°] B ) du. To minimize this error, we set the gradient 
with respect to B to zero and obtain the p linear equations ws 397 


f (O, u, ...,u”7!] B — [1,u, u?, u°] B’) du = 0, 
f ([.u,....u? YB - [1, u, u?, u? 16°) udu = 0, 


I (O, u, ... uP] B — [1, u, u°, u°] B’) udu = 0. 
Let i 
H, - | [1,u,...,u? "][1,u,...,uP "]" du 
0 


be the p x p Hilbert matrix, which has (i, j)-th entry given by ji us? du = 1/(i+ j— 1). HILBERT MATRIX 
Then, the above system of linear equations can be written as H,B = Hp", where H is the 

p X 4 upper left sub-block of Hj and p = max{p, 4}. The solution, which we denote by £,,, 

is: 


6 = 
S, p i 1, 
[-2, 3577, z 
B, = P (2.18) 
[-2, 10, 25]" >, p = 3, 
[10, —140, 400, -250,0,...,0]', p24. 
Hence, the approximation error E (99°(X) — g*(X ) is given by 
ee y x 127.9, p=1, 
1 ne x 25. 8, p= 2 
1,u,... uP] 8, —[u,02,u]B") du = 2.19 
K u,... uP] B,- [1,u, u, u°] 6) du E5 993, oa (2.19) 
0, p24 
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Notice how the approximation error becomes smaller as p increases. In this particular 
example the approximation error is in fact zero for p > 4. In general, as the class of ap- 
proximating functions G becomes more complex, the approximation error goes down. 

Next, we illustrate the typical behavior of the statistical error. Since g;(x) = x'B, the 
statistical error can be written as 


1 — — — 
(Gee IB - Bp) du = B= 8, HB-B, (2.20) 
0 


Figure 2.8 illustrates the decomposition (2.17) of the generalization risk for the same train- 
ing set that was used to compute the test loss in Figure 2.7. Recall that test loss gives an 
estimate of the generalization risk, using independent test data. Comparing the two figures, 
we see that in this case the two match closely. The global minimum of the statistical error is 
approximately 0.28, with minimizer p = 4. Since the approximation error is monotonically 
decreasing to zero, p = 4 is also the global minimizer of the generalization risk. 
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Figure 2.8: The generalization risk for a particular training set is the sum of the irreducible 
error, the approximation error, and the statistical error. The approximation error decreases 
to zero as p increases, whereas the statistical error has a tendency to increase after p = 4. 


Note that the statistical error depends on the estimate B, which in its turn depends on 
the training set r. We can obtain a better understanding of the statistical error by consid- 
ering its expected behavior; that is, averaged over many training sets. This is explored in 
Exercise 11. m 


Using again a squared-error loss, a second decomposition (for general G) starts from 
EBS) = C + E8) - g’), 


where the statistical error and approximation error are combined. Using similar reasoning 
as in the proof of Theorem 2.1, we have 


(99) = B(g9(X) - YP = C +E (e9(X) - 8'0) = 6 +EDXX, 7), 
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where D(x,T) := gf (x) — g*(x). Now consider the random variable D(x, T) for a random 
training set 7. The expectation of its square is: 
2 
E (fœ) - g*(x)) = ED? œ, T) = (ED(x,7)) + Var D(x,T) 
= (Ege (x) - g") + VargZ(x) . ae 
—$—$—$—$——$— mm — ——’ 
pointwise squared bias pointwise variance 

If we view the learner 2 (x) as a function of a random training set, then the pointwise 
squared bias term is a measure for how close 2 (x) is on average to the true g*(x), POINTWISE 
whereas the pointwise variance term measures the deviation of 2 (x) from its expected ee 
value Eg? (x). The squared bias can be reduced by making the class of functions G more VARIANCE 


complex. However, decreasing the bias by increasing the complexity often leads to an in- 
crease in the variance term. We are thus seeking learners that provide an optimal balance 
between the bias and variance, as expressed via a minimal generalization risk. This is called 
the bias—variance tradeoff. 

Note that the expected generalization risk (2.6) can be written as (* +ED?(X, T), where 
X and J are independent. It therefore decomposes as 


E (82) = € + E ŒF) |X] - g(X) + E[ Varl} 0X]. (2.22) 
\—— eee [mmm M aM 
expected squared bias expected variance 


2.5 Estimating Risk 


The most straightforward way to quantify the generalization risk (2.5) is to estimate it via 
the test loss (2.7). However, the generalization risk depends inherently on the training set, 
and so different training sets may yield significantly different estimates. Moreover, when 
there is a limited amount of data available, reserving a substantial proportion of the data 
for testing rather than training may be uneconomical. In this section we consider different 
methods for estimating risk measures which aim to circumvent these difficulties. 


2.5.1 In-Sample Risk 


We mentioned that, due to the phenomenon of overfitting, the training loss of the learner, 
€,(g,) (for simplicity, here we omit G from gE ), is not a good estimate of the generalization 
risk €(g,) of the learner. One reason for this is that we use the same data for both training 
the model and assessing its risk. How should we then estimate the generalization risk or 
expected generalization risk? 

To simplify the analysis, suppose that we wish to estimate the average accuracy of the 
predictions of the learner g, at the n feature vectors x;,...,x, (these are part of the training 
set T). In other words, we wish to estimate the in-sample risk of the learner g;: 


1 n 
lal) = = X E Loss(Y;, g:(x:)), (2.23) 
i=1 


where each response Y; is drawn from f(y |x;), independently. Even in this simplified set- 
ting, the training loss of the learner will be a poor estimate of the in-sample risk. Instead, the 


BIAS—VARIANCE 
TRADEOFF 


IN-SAMPLE RISK 
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EXPECTED 
OPTIMISM 


proper way to assess the prediction accuracy of the learner at the feature vectors x1, ..., Xn, 
is to draw new response values Y; ~ f(y|x;), i = 1,...,n, that are independent from the 
responses y,,..., Yn in the training data, and then estimate the in-sample risk of g, via 


1 n 
ie L Y;, TV\AT))- 
2 oss(Y;, 27(X;)) 


For a fixed training set t, we can compare the training loss of the learner with the 
in-sample risk. Their difference, 


op, = in(87) — £,( 82); 


is called the optimism (of the training loss), because it measures how much the training 
loss underestimates (is optimistic about) the unknown in-sample risk. Mathematically, it is 
simpler to work with the expected optimism: 


Elop; |X; = %1,...,X, = X,] =: Ex op,, 


where the expectation is taken over a random training set 7, conditional on X; = x;, 
i= 1,...,n. For ease of notation, we have abbreviated the expected optimism to Ex op, 
where Ex denotes the expectation operator conditional on X; = x;,i = 1,...,n. As in Ex- 
ample 2.1, the feature vectors are stored as the rows of an nX p matrix X. It turns out that the 
expected optimism for various loss functions can be expressed in terms of the (conditional) 
covariance between the observed and predicted response. 


Theorem 2.2: Expected Optimism 





Proof: In what follows, all expectations are taken conditional on X; = x1,..., Xn = Xn. 
Let Y; be the response for x; and let Y, = g7 (xi) be the predicted value. Note that the latter 
depends on Yj,...,Y,. Also, let Y; be an independent copy of Y; for the same x;, as in 
(2.23). In particular, Y? has the same distribution as Y; and is statistically independent of 
all {Y;}, including Y;, and therefore is also independent of Y,. We have 


Bx opr = 72E [ay - YP - Y: - Yy] -iye Y; - YY, 


=< D (Ex[Y;¥i] - ExY; ExY;) -iy Covx(Y;, Y;). 
i=1 


The proof for the 0-1 loss with 0-1 response is left as Exercise 4. o 


In summary, the expected optimism indicates how much, on average, the training loss 
deviates from the expected in-sample risk. Since the covariance of independent random 
variables is zero, the expected optimism is zero if the learner gy is statistically independent 
from the responses Yj,..., Yn. 
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E Example 2.3 (Polynomial Regression (cont.)) We continue Example 2.2, where the 
components of the response vector Y = [Y,,..., Y,]' are independent and normally distrib- 
uted with variance ¢* = 25 (the irreducible error) and expectations EY; = g*(x;) = x; p”, 
i = 1,...,n. Using the formula (2.15) for the least-squares estimator B, the expected op- 
timism (2.24) is 


z 3 Covy (x7B. Y i) = Zir (Covx (XB, Y)) = Zir (Covx (XXTY, Y)) 
i=1 


_ 2tr(XX*Covy (Y, Y))  2&tr(XX*) 2p 


n n n 
In the last equation we used the cyclic property of the trace (Theorem A.1): tr(XX*) = 
tr(X*X) = tr(I,), assuming that rank(X) = p. Therefore, an estimate for the in-sample risk 
(2.23) is: pr 
€in(87) = €,(8,) + 2¢* p/n, (2.25) 


where we have assumed that the irreducible risk €* is known. Figure 2.9 shows that this 
estimate is very close to the test loss from Figure 2.7. Hence, instead of computing the test 
loss to assess the best model complexity p, we could simply have minimized the training 
loss plus the correction term 2¢* p/n. In practice, ¢* also has to be estimated somehow. 
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Figure 2.9: In-sample risk estimate Talg) as a function of the number of parameters p of 
the model. The test loss is superimposed as a blue dashed curve. 


2.5.2 Cross-Validation 


In general, for complex function classes G, it is very difficult to derive simple formulas of 
the approximation and statistical errors, let alone for the generalization risk or expected 
generalization risk. As we saw, when there is an abundance of data, the easiest way to 
assess the generalization risk for a given training set T is to obtain a test set 7’ and evaluate 
the test loss (2.7). When a sufficiently large test set is not available but computational 
resources are cheap, one can instead gain direct knowledge of the expected generalization 
risk via a computationally intensive method called cross-validation. 
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FOLDS 


K-FOLD 
CROSS-VALIDATION 


The idea is to make multiple identical copies of the data set, and to partition each copy 
into different training and test sets, as illustrated in Figure 2.10. Here, there are four copies 
of the data set (consisting of response and explanatory variables). Each copy is divided into 
a test set (colored blue) and training set (colored pink). For each of these sets, we estimate 
the model parameters using only training data and then predict the responses for the test 
set. The average loss between the predicted and observed responses is then a measure for 
the predictive power of the model. 


Resp. Expl. Resp. Expl. Resp. Expl. Resp. Expl. 


test 
test 
test 
test 


Figure 2.10: An illustration of four-fold cross-validation, representing four copies of the 
same data set. The data in each copy is partitioned into a training set (pink) and a test 
set (blue). The darker columns represent the response variable and the lighter ones the 
explanatory variables. 


In particular, suppose we partition a data set 7 of size n into K folds C\,...,Cx of sizes 
N,...,NxK (hence, nj +--+ ng =n). Typically ng x n/K, k =1,...,K. 

Let fc, be the test loss when using Cx as test data and all remaining data, denoted 7_;, 
as training data. Each fc, is an unbiased estimator of the generalization risk for training set 
7 _,; that is, for €(g7_,). 

The K-fold cross-validation loss is the weighted average of these risk estimators: 


K 


n 
CV = »S fo, (87) 


k=1 


K 
L D5 > Loss(g7_,(%i), Yi) 


k=1 ieC, 


1 n 
a L ; ijs Vi)» 


where the function x : {1,...,n} b> {1,..., K} indicates to which of the K folds each 
of the n observations belongs. As the average is taken over varying training sets {7_,}, it 
estimates the expected generalization risk E (g), rather than the generalization risk €(g,) 
for the particular training set T. 


E Example 2.4 (Polynomial Regression (cont.)) For the polynomial regression ex- 
ample, we can calculate a K-fold cross-validation loss with a nonrandom partitioning of the 
training set using the following code, which imports the previous code for the polynomial 
regression example. We omit the full plotting code. 
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polyregCV.py 


from polyreg3 import 


* 


K_vals = [5, 10, 100] # number of folds 
cv = np.zeros((len(K_vals), max_p)) # cv loss 


X = np.ones((n, 1)) 


for p in p_range: 
Ve po I: 
np.hstack((X, u**(p-1))) 


j 0 
for K in K_vals: 
loss = [] 
for k in range(1, K+1): 
# integer indices of test samples 
C(n/K)*(k-1) + np.arange(1,n/K+1)-1).astypeC'int') 


test_ind = 
np.setdiffld(np.arange(n), test_ind) 


train_ind = 
X_train, y_train = X[train_ind, :], y[train_ind, :] 
X_test, y_test = X[test_ind, :], y[test_ind] 

# fit model and evaluate test loss 
betahat = solve(X_train.T @ X_train, 
loss.append(norm(y_test - X_test @ betahat) 


X_train.T @ y_train) 
we 2) 


cv[j, p-1] = sum(loss)/n 
j t= 1 


# basic plotting 
plt.plot(p_range, cv[0, 
plt.plot(p_range, cv[1, 
plt.plot(p_range, cv[2, 
plt.show() 
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Figure 2.11: K-fold cross-validation for the polynomial regression example. 
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Figure 2.11 shows the cross-validation loss for K € {5, 10, 100}. The case K = 100 cor- 
responds to the leave-one-out cross-validation, which can be computed more efficiently 
using the formula in Theorem 5.1. E 


2.6 Modeling Data 


The first step in any data analysis is to model the data in one form or another. For example, 
in an unsupervised learning setting with data represented by a vector x = [x1,...,x,]',a 
very general model is to assume that x is the outcome of a random vector X = [X),...,Xp]" 
with some unknown pdf f. The model can then be refined by assuming a specific form of 
f. 

When given a sequence of such data vectors x1, ..., Xn, one of the simplest models is to 
assume that the corresponding random vectors X,,..., X„ are independent and identically 
distributed (iid). We write 

Xi... X, fF or Xi... X, X Dist, 
to indicate that the random vectors form an iid sample from a sampling pdf f or sampling 
distribution Dist. This model formalizes the notion that the knowledge about one variable 
does not provide extra information about another variable. The main theoretical use of 
independent data models is that the joint density of the random vectors X,,..., X,, is simply 
the product of the marginal ones; see Theorem C.1. Specifically, 


Pt 5 Xq( X15 0+ +¥n) = fa) FX). 


In most models of this kind, our approximation or model for the sampling distribution is 
specified up to a small number of parameters. That is, g(x) is of the form g(x |8) which 
is known up to some parameter vector 8. Examples for the one-dimensional case (p = 1) 
include the N(u, 0), Bin(n, p), and Exp(A) distributions. See Tables C.1 and C.2 for other 
common sampling distributions. 

Typically, the parameters are unknown and must be estimated from the data. In a non- 
parametric setting the whole sampling distribution would be unknown. To visualize the 
underlying sampling distribution from outcomes x,,...,x, One can use graphical repres- 
entations such as histograms, density plots, and empirical cumulative distribution func- 
tions, as discussed in Chapter 1. 

If the order in which the data were collected (or their labeling) is not informative or 


relevant, then the joint pdf of X,,..., X, satisfies the symmetry: 
Tin Xa X1- Xn) = Sx, as Xa, Xni eo Xn) (2.26) 
for any permutation 7,...,7, of the integers 1,...,n. We say that the infinite sequence 


X 1, X2, .. . is exchangeable if this permutational invariance (2.26) holds for any finite subset 
of the sequence. As we shall see in Section 2.9 on Bayesian learning, it is common to 
assume that the random vectors X,,..., X,, are a subset of an exchangeable sequence and 
thus satisfy (2.26). Note that while iid random variables are exchangeable, the converse is 
not necessarily true. Thus, the assumption of an exchangeable sequence of random vectors 
is weaker than the assumption of iid random vectors. 


Chapter 2. Statistical Learning 


41 





Figure 2.12 illustrates the modeling tradeoffs. The keywords within the triangle repres- 
ent various modeling paradigms. A few keywords have been highlighted, symbolizing their 
importance in modeling. The specific meaning of the keywords does not concern us here, 
but the point is there are many models to choose from, depending on what assumptions are 
made about the data. 
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Figure 2.12: Illustration of the modeling dilemma. Complex models are more generally 
applicable, but may be difficult to analyze. Simple models may be highly tractable, but 
may not describe the data accurately. The triangular shape signifies that there are a great 
many specific models but not so many generic ones. 


On the one hand, models that make few assumptions are more widely applicable, but at 
the same time may not be very mathematically tractable or provide insight into the nature 
of the data. On the other hand, very specific models may be easy to handle and interpret, but 
may not match the data very well. This tradeoff between the tractability and applicability of 
the model is very similar to the approximation—estimation tradeoff described in Section 2.4. 

In the typical unsupervised setting we have a training set T = {x,,...,X,} that is viewed 
as the outcome of n iid random variables X,,..., X„ from some unknown pdf f. The ob- 
jective is then to learn or estimate f from the finite training data. To put the learning in 
a similar framework as for supervised learning discussed in the preceding Sections 2.3— 
2.5, we begin by specifying a class of probability density functions G, := {g(-|6),@ € ©}, 
where @ is a parameter in some subset © of RP. We now seek the best g in G, to minimize 
some risk. Note that G, may not necessarily contain the true f even for very large p. 


We stress that our notation g(x) has a different meaning in the supervised and unsu- 
pervised case. In the supervised case, g is interpreted as a prediction function for a 


response y; in the unsupervised setting, g is an approximation of a density f. 





For each x we measure the discrepancy between the true model f(x) and the hypothes- 
ized model g(x |0) using the loss function 


f(x) 


Loss(f(x), g(x |@)) = In g(x |) 





= In f(x) — ln g(x |0). 
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The expected value of this loss (that is, the risk) is thus 


F(X) f fŒ) 
=El 2,27 
te) nX f(x) In me 16) (2.27) 


The integral in (2.27) provides a fundamental way to measure the distance between two 
densities and is called the Kullback—Leibler (KL) divergence” between f and g(- |0). Note 
that the KL divergence is not symmetric in f and g(- |0). Moreover, it is always greater 
than or equal to 0 (see Exercise 15) and equal to 0 when f = g(-|@). 

Using similar notation as for the supervised learning setting in Table 2.1, define g% as 
the global minimizer of the risk in the class G,,; that is, g?r = argmin .g , €(g). If we define 








6 = argmin E Loss( f(X), g(X | 0)) = argmin Í (in f(x) — In g(x | 0)) f(x) dx 
o o 
= argmax f f(x) In g(x |0) dx = argmax E ln g(X |0), 
o 0 


then g9 = g(-| 6") and learning 29 is equivalent to learning (or estimating) @*. To learn 6* 
from a training set T = {x1, .. . , Xn} we then minimize the training loss, 


1 n 1 n 1 n 
ia L i)> i = l i 7 l i)> 
no oss(f(x;), (x: | 4) 2 n g(x;|0) + pa n f(x) 
giving: 
= i eae 
6,, := argmax — > In g(x;|0). (2.28) 
e Mec 


As the logarithm is an increasing function, this is equivalent to 
n 
0, = argmax | | g(x;14), 
0 z 
i=1 


where []j_, g(x; | 9) is the likelihood of the data; that is, the joint density of the {X;} eval- 
uated at the points {x;}. We therefore have recovered the classical maximum likelihood 
estimate of 0°. 

When the risk €(g(-|@)) is convex in 0 over a convex set ©, we can find the maximum 
likelihood estimator by setting the gradient of the training loss to zero; that is, we solve 


-1 Y s(x;|6) =0 
a i=l 


where S(x | 0) := on j is the gradient of In g(x |0) with respect to 0 and is often called 
the score. 


E Example 2.5 (Exponential Model) Suppose we have the training dataT, = {x,..., Xn}, 
which is modeled as a realization of n positive iid random variables: X,,..., Xn ~ia fœ). 
We select the class of approximating functions G to be the parametric class {g : g(x|0) = 





>Sometimes called cross-entropy distance. 
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Oexp(—x 0),x > 0,4 > 0}. In other words, we look for the best g? within the family of 
exponential distributions with unknown parameter 0 > 0. The likelihood of the data is 


| [silo = | | @exp(-ax) = exp(-nX, +n In 6) 


i=1 i=l 


and the score is S (x|@) = —x+67!. Thus, maximizing the likelihood with respect to 8 is the 
same as maximizing —Onx, + n1n@ or solving — X}; S(x;|0)/n = X, - 6! = 0. In other 
words, the solution to (2.28) is the maximum likelihood estimate 6, = 1/X,. Oo 


In a supervised setting, where the data is represented by a vector x of explanatory 
variables and a response y, the general model is that (x,y) is an outcome of (X, Y) ~ f 
for some unknown f. And for a training sequence (xX;,y1),..-, (Xn, Yn) the default model 
assumption is that (X1, Y1), ..., (Xn, Yn) ~iia f. AS explained in Section 2.2, the analysis 
primarily involves the conditional pdf f(y|x) and in particular (when using the squared- 
error loss) the conditional expectation g*(x) = E[Y|X = x]. The resulting representation 
(2.2) allows us to then write the response at X = x as a function of the feature x plus an 
error term: Y = g*(x) + e(x). 

This leads to the simplest and most important model for supervised learning, where we 
choose a linear class G of prediction or guess functions and assume that it is rich enough 
to contain the true g*. If we further assume that, conditional on X = x, the error term € 
does not depend on x, that is, Ee = 0 and Var £ = g?, then we obtain the following model. 


Definition 2.1: Linear Model 


In a linear model the response Y depends on a p-dimensional explanatory variable 
x = [x,,...,X,]" via the linear relationship 


Y=x'Brte, (2.29) 


where Ee = 0 and Var £ = o°. 





Note that (2.29) is a model for a single pair (x, Y). The model for the training set 
{(x;, Y;)} is simply that each Y; satisfies (2.29) (with x = x;) and that the {Y;} are independ- 


ent. Gathering all responses in the vector Y = [Yj,..., Y,]', we can write 

Y = X$ +e, (2.30) 
where € = [€,...,€,]' is a vector of iid copies of £ and X is the so-called model matrix, 
with rows x|,...,x,. Linear models are fundamental building blocks of statistical learning 


algorithms. For this reason, a large part of Chapter 5 is devoted to linear regression models. 


E Example 2.6 (Polynomial Regression (cont.)) For our running Example 2.1, we see 
that the data is described by a linear model of the form (2.30), with model matrix X given 
in (2.10). E 
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Before we discuss a few other models in the following sections, we would like to em- 
phasize a number of points about modeling. 


e Any model for data is likely to be wrong. For example, real data (as opposed to 
computer-generated data) are often assumed to come from a normal distribution, 
which is never exactly true. However, an important advantage of using a normal 
distribution is that it has many nice mathematical properties, as we will see in Sec- 
tion 2.7. 


e Most data models depend on a number of unknown parameters, which need to be 
estimated from the observed data. 


e Any model for real-life data needs to be checked for suitability. An important cri- 
terion is that data simulated from the model should resemble the observed data, at 
least for a certain choice of model parameters. 


Here are some guidelines for choosing a model. Think of the data as a spreadsheet or 
data frame, as in Chapter 1, where rows represent the data units and the columns the data 
features (variables, groups). 


e First establish the type of the features (quantitative, qualitative, discrete, continuous, 
etc.). 


e Assess whether the data can be assumed to be independent across rows or columns. 


e Decide on the level of generality of the model. For example, should we use a simple 
model with a few unknown parameters or a more generic model that has a large 
number of parameters? Simple specific models are easier to fit to the data (low es- 
timation error) than more general models, but the fit itself may not be accurate (high 
approximation error). The tradeoffs discussed in Section 2.4 play an important role 
here. 


e Decide on using a classical (frequentist) or Bayesian model. Section 2.9 gives a short 
ns 47 introduction to Bayesian learning. 


2./ Multivariate Normal Models 


A standard model for numerical observations x,,...,x, (forming, e.g., a column in a 
spreadsheet or data frame) is that they are the outcomes of iid normal random variables 


Xis... X, ®© Niu, 0). 


It is helpful to view a normally distributed random variable as a simple transformation 

of a standard normal random variable. To wit, if Z has a standard normal distribution, then 

X = u + oZ has a N(u, o°) distribution. The generalization to n dimensions is discussed 

rss 434 in Appendix C.7. We summarize the main points: Let Z1,..., Zn ~ N(0, 1). The pdf of 
Z =[Z,,...,Z,]' (that is, the joint pdf of Z;,...,Z,) is given by 


falz) = [| ee  = (ny ie 24%, zeR". (2.31) 
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We write Z ~ N(0,I,,) and say that Z has a standard normal distribution in R”. Let 
X=p"+BZ (2.32) 


for some m Xn matrix B and m-dimensional vector u. Then X has expectation vector u and 
covariance matrix 2 = BB‘; see (C.20) and (C.21). This leads to the following definition. 


Definition 2.2: Multivariate Normal Distribution 


An m-dimensional random vector X that can be written in the form (2.32) for some 


m-dimensional vector u and m X n matrix B, with Z ~ N(O,I,), is said to have a 
multivariate normal or multivariate Gaussian distribution with mean vector u and 
covariance matrix 2 = BB". We write X ~ N(4, X). 





The m-dimensional density of a multivariate normal distribution has a very similar form 
to the density of the one-dimensional normal distribution and is given in the next theorem. 
We leave the proof as an exercise; see Exercise 5. 


Theorem 2.3: Density of a Multivariate Random Vector 





Figure 2.13 shows the pdfs of two bivariate (that is, two-dimensional) normal distribu- 
tions. In both cases the mean vector is u = [0,0]' and the variances (the diagonal elements 
of X) are 1. The correlation coefficients (or, equivalently here, the covariances) are respect- 
ively o = 0 and o = 0.8. 



















































































Figure 2.13: Pdfs of bivariate normal distributions with means zero, variances 1, and cor- 
relation coefficients 0 (left) and 0.8 (right). 
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46 2.8. Normal Linear Models 
The main reason why the multivariate normal distribution plays an important role in 
data science and machine learning is that it satisfies the following properties, the details 
ns 434 and proofs of which can be found in Appendix C.7: 
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1. Affine combinations are normal. 
2. Marginal distributions are normal. 


3. Conditional distributions are normal. 


2.8 Normal Linear Models 
Normal linear models combine the simplicity of the linear model with the tractability of 


the Gaussian distribution. They are the principal model for traditional statistics, and include 
the classic linear regression and analysis of variance models. 


Definition 2.3: Normal Linear Model 


In a normal linear model the response Y depends on a p-dimensional explanatory 
variable x = [x),...,x,]", via the linear relationship 


Y=x"Bte, (2.34) 


where £ ~ N(0, o°). 





Thus, a normal linear model is a linear model (in the sense of Definition 2.1) with 
normal error terms. Similar to (2.30), the corresponding normal linear model for the whole 
training set {(x;, Y;)} has the form 


Y = X$ +e, (2.35) 


where X is the model matrix comprised of rows x],...,x„ and s~ N(0, o’I,). Con- 
sequently, Y can be written as Y = XB + oZ, where Z ~ N(0, I„), so that Y ~ N(X£, c’I,). 
It follows from (2.33) that its joint density is given by 


80 1B, 0, X) = 2r0?y Ë eal XA (2.36) 


Estimation of the parameter 8 can be performed via the least-squares method, as discussed 
in Example 2.1. An estimate can also be obtained via the maximum likelihood method. 
This simply means finding the parameters o? and £$ that maximize the likelihood of the 
outcome y, given by the right-hand side of (2.36). It is clear that for every value of o? 
the likelihood is maximal when ||y — X]|* is minimal. As a consequence, the maximum 
likelihood estimate for £ is the same as the least-squares estimate (2.15). We leave it as an 
exercise (see Exercise 18) to show that the maximum likelihood estimate of g? is equal to 


mab XIP 
E n 


i (2.37) 


where B is the maximum likelihood estimate (least squares estimate in this case) of 2. 
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2.9 Bayesian Learning 


In Bayesian unsupervised learning, we seek to approximate the unknown joint density 
f(x1,...,X,) of the training data 7,, = {X,..., Xn} via a joint pdf of the form 


Í i g(x; o] w(0) dd, (2.38) 
i=l 


where g(: |0) belongs to a family of parametric densities G, := {g( |0), 0 € ©} (viewed 
as a family of pdfs conditional on a parameter 0 in some set ®© c RP) and w(@) is a pdf 
that belongs to a (possibly different) family of densities W,. Note how the joint pdf (2.38) 
satisfies the permutational invariance (2.26) and can thus be useful as a model for training 
data which is part of an exchangeable sequence of random variables. 


Following standard practice in a Bayesian context, instead of writing f(x) and 
fx\y(x|y) for the pdf of X and the conditional pdf of X given Y, one simply writes 


f(x) and f(x|y). If Y is a different random variable, its pdf (at y) is thus denoted by 
fO). 


Thus, we will use the same symbol g for different (conditional) approximating probab- 
ility densities and f for the different (conditional) true and unknown probability densities. 
Using Bayesian notation, we can write g(t |0) = []j_, g(x;:10) and thus the approximating 
joint pdf (2.38) can then be written as f g(t | 0) w(@) dé and the true unknown joint pdf as 
fT) = fxi,- Xn). 


Once G, and W, are specified, selecting an approximating function g(x) of the form 


g(x) = { g(x |0) w(8) dé 


is equivalent to selecting a suitable w from W,,. Similar to (2.27), we can use the Kullback— 
Leibler risk to measure the discrepancy between the proposed approximation (2.38) and the 


true f(T): 
fT -f fe 
€(g) = Eln dr. 2.39 
CRE TT wade d Teelywaae” = OO 


The main difference with (2.27) is that since the training data is not necessarily iid (it may 
be exchangeable, for example), the expectation must be with respect to the joint density of 
T , not with respect to the marginal f(x) (as in the iid case). 

Minimizing the training loss is equivalent to maximizing the likelihood of the training 
data T; that is, solving the optimization problem 


max { g(T |0) w(0) dd, 


wEWp 


where the maximization is over an appropriate class W, of density functions that is be- 
lieved to result in the smallest KL risk. 





40 


48 


2.9. Bayesian Learning 





i= 161 


us 428 


PRIOR 


LIKELIHOOD 


POSTERIOR 


Suppose that we have a rough guess, denoted wo(), for the best w € W, that min- 
imizes the Kullback—Leibler risk. We can always increase the resulting likelihood Lo := 
f g(T|@) wo(@) dé by instead using the density wı(0) := wo(@) e(t |0)/Lo, giving a likeli- 
hood L; := f g(t | 0) w,(@) dé. To see this, write Lọ and L, as expectations with respect to 
Wo. In particular, we can write 

Lo = Ey, g(t|0) and Li = Ey, 9(t|0) = Eng (T10)/Lo. 
It follows that 
fT, [g°(r 14) - L] = + varn [g1] > 0. (2.40) 
Lo Lo 

We may thus expect to obtain better predictions using w, instead of wọ, because w; has 
taken into account the observed data T and increased the likelihood of the model. In fact, 
if we iterate this process (see Exercise 20) and create a sequence of densities w1, w2,... 
such that w,(@) « w,-1(0) (T |0), then w,(@) concentrates more and more of its probability 
mass at the maximum likelihood estimator 0 (see (2.28)) and in the limit equals a (degen- 
erate) point-mass pdf at 0. In other words, in the limit we recover the maximum likelihood 
method: g,(x) = g(x 0). Thus, unless the class of densities W, is restricted to be non- 
degenerate, maximizing the likelihood as much as possible leads to a degenerate choice 
for w(0). 

In many situations, the maximum likelihood estimate g(t | 9) is either not an ap- 
propriate approximation to f(t) (see Example 2.9), or simply fails to exist (see Exer- 
cise 10 in Chapter 4). In such cases, given an initial non-degenerate guess wọ(0) = g(0), 
one can obtain a more appropriate and non-degenerate approximation to f(t) by taking 
w(8) = wi(9) « g(t | 4) (0) in (2.38), giving the following Bayesian learner of f(x): 


g(t!) (0) 


g(x) := feo lO) M (2.41) 
fealdad 
where f g(t |?) g() dŷ = g(7). Using Bayes’ formula for probability densities, 
g(@|T) = AEE (2.42) 
(7) 


we can write w;(@) = g(0 |rt). With this notation, we have the following definitions. 


Definition 2.4: Prior, Likelihood, and Posterior 


Let t and G, := {g(-|@),@ € ©} be the training set and family of approximating 
functions. 


e A pdf g(@) that reflects our a priori beliefs about @ is called the prior pdf. 


e The conditional pdf g(t |0) is called the likelihood. 


e Inference about @ is given by the posterior pdf g(@|T), which is proportional 
to the product of the prior and the likelihood: 


gO |T) œ g(T| 4) g0). 





Wis 


(É 
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E Remark 2.1 (Early Stopping) Bayes iteration is an example of an “early stopping” 
heuristic for maximum likelihood optimization, where we exit after only one step. As ob- 
served above, if we keep iterating, we obtain the maximum likelihood estimate (MLE). In 
a sense the Bayes rule provides a regularization of the MLE. Regularization is discussed in 
more detail in Chapter 6; see also Example 2.9. The early stopping rule is also of benefit 
in regularization; see Exercise 20 in Chapter 6. E 


On the one hand, the initial guess g(@) conveys the a priori (prior to training the 
Bayesian learner) information about the optimal density in W, that minimizes the KL risk. 
Using this prior g(0), the Bayesian approximation to f(x) is the prior predictive density: 


g(x) = f g(x |) g(0) dé. 


On the other hand, the posterior pdf conveys improved knowledge about this optimal dens- 
ity in W, after training with t. Using the posterior g(@|T), the Bayesian learner of f(x) is 
the posterior predictive density: 


gota) = e(uls)= [oC 16) 9(617) 08 


where we have assumed that g(x |@,7) = g(x |0); that is, the likelihood depends on T only 
through the parameter 0. 
The choice of the prior is typically governed by two considerations: 


1. the prior should be simple enough to facilitate the computation or simulation of the 
posterior pdf; 


2. the prior should be general enough to model ignorance of the parameter of interest. 


Priors that do not convey much knowledge of the parameter are said to be uninformat- 
ive. The uniform or flat prior in Example 2.9 (to follow) is frequently used. 


For the purpose of analytical and numerical computations, we can view @ as a ran- 


dom vector with prior density 9(@), which after training is updated to the posterior 
density g(0 |7). 





The above thinking allows us to write g(x |T) « f g(x |) g(t| 8) g(@) dd, for example, 
thus ignoring any constants that do not depend on the argument of the densities. 


E Example 2.7 (Normal Model) Suppose that the training data T = {X),...,X,} is 
modeled using the likelihood g(x | 6) that is the pdf of 


X|0~ Nu, o°), 


where 0 := [u,o?]". Next, we need to specify the prior distribution of @ to complete 
the model. We can specify prior distributions for u and o? separately and then take their 
product to obtain the prior for vector 0 (assuming independence). A possible prior distri- 
bution for u is 

u~ Nv, g’). (2.43) 
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It is typical to refer to any parameters of the prior density as hyperparameters of the 
Bayesian model. Instead of giving directly a prior for o? (or œ), it turns out to be con- 
venient to give the following prior distribution to 1/07: 


1 
— ~ Gamma(a, p). (2.44) 

o 
The smaller «œ and £ are, the less informative is the prior. Under this prior, a” is said to have 
an inverse gamma’ distribution. If 1/Z ~ Gamma(a,), then the pdf of Z is proportional 


to exp (—B/z) /z**! (Exercise 19). The Bayesian posterior is then given by: 


gU, 07 |T) & gU) X 8°) X g(t lu, 07) 





l uw exp{-B/o?} — exp{- Dita - #)°/(20)} 
x exp, — ——.|._ jj 





2¢2 (o2)! (o)"/2 
2 s\24 02 
2)-n/2-a-1 (u =y) B (u = Xn) + Si, 
where S2 := 1,2? — x, = + Y(x; — X,)” is the (scaled) sample variance. All inference 


about (u, o°) is then represented by the posterior pdf. To facilitate computations it is helpful 
to find out if the posterior belongs to a recognizable family of distributions. For example, 
the conditional pdf of u given o° and T is 





_ yy2 _=)2 
gulor) « exp{-& 2 Gann 


2¢7 207/n 
which after simplification can be recognized as the pdf of 


UI, T) ~N(ynkn + (1 = Yay, Yn? /n), (2.45) 
F + 2) . We can then see that the 
posterior mean Efu | o°, T] = YnXn + (1 — y,)v is a weighted linear combination of the prior 
mean y and the sample average x,. Further, as n — ov, the weight y, — 1 and thus the 
posterior mean approaches the maximum likelihood estimate x,,. E 


where we have defined the weight parameter: y, := 4 I ( 


It is sometimes possible to use a prior g(0) that is not a bona fide probability density, in the 
sense that f g(0) d0 = œ, as long as the resulting posterior g(0 |T) « g(t |0)e(0) is a proper 
pdf. Such a prior is called an improper prior. 


E Example 2.8 (Normal Model (cont.)) An example of an improper prior is obtained 
from (2.43) when we let ø — oo (the larger ¢ is, the more uninformative is the prior). 
Then, g(u) « 1 is a flat prior, but f g(u)du = œ, making it an improper prior. Neverthe- 
less, the posterior is a proper density, and in particular the conditional posterior of (u | o°, T) 
simplifies to 

ulo’, T) ~ N (Xn 0°/n), 





3Reciprocal gamma distribution would have been a better name. 
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because the weight parameter y, goes to 1 as 6 — œ. The improper prior g(u) « 1 also 
allows us to simplify the posterior marginal for o°: 


2 
g(a” |) = feue IT) du œ (o?) 0?" exp [e oh. 


which we recognize as the density corresponding to 


n- 


1 
— |t ~ Gammala+ 
o? Ir [e 2 


n 
, B+ 252]. 
pats: 
In addition to g(w) « 1, we can also use an improper prior for o°. If we take the limit a — 0 
and 6 — 0 in (2.44), then we also obtain the improper prior g(a”) « 1/0? (or equivalently 


g(1/0°) « 1/0°). In this case, the posterior marginal density for 0 implies that: 


nS? 


2 
o2 T ~ Xn-1 








and the posterior marginal density for u implies that: 


HU — Xn 
Sn/Vn-1 


In general, deriving a simple formula for the posterior density of @ is either impossible 
or too tedious. Instead, the Monte Carlo methods in Chapter 3 can be used to simulate 
(approximately) from the posterior for the purposes of inference and prediction. a 


|r = oe (2.46) 


One way in which a distributional result such as (2.46) can be useful is in the construc- 
tion of a 95% credible interval J for the parameter ju; that is, an interval Z such that the 
probability P[u € 7 |T] is equal to 0.95. For example, the symmetric 95% credible interval 
is 


n 


I = |x, yl, 
-1 





Sn Xn + 
Vn - i < n 
where y is the 0.975-quantile of the t„-ı distribution. Note that the credible interval is 
not a random object and that the parameter u is interpreted as a random variable with a 
distribution. This is unlike the case of classical confidence intervals, where the parameter 
is nonrandom, but the interval is (the outcome of) a random object. 
As a generalization of the 95% Bayesian credible interval we can define a 1 —a@ credible 
region, which is any set R satisfying 


Pte Rit} = | g(0|T)d0 > 1 -a. (2.47) 
OER 


CREDIBLE 
INTERVAL 


ns 457 
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MAXIMUM A 
POSTERIORI 


E Example 2.9 (Bayesian Regularization of Maximum Likelihood) Consider model- 
ing the number of deaths during birth in a maternity ward. Suppose that the hospital data 
consists of T = {x,,...,X,}, with x; = 1 if the i-th baby has died during birth and x; = O 


otherwise, fori = 1,...,n. A possible Bayesian model for the data is 6 ~ U(O, 1) (uniform 
prior) with (X;,...,X,|0) “ Ber(0). The likelihood is therefore 


galo =| [ea -@'* =e -o", 
i=l 
where s = x; +--+ + x, is the total number of deaths. Since g(@) = 1, the posterior pdf is 


galt) x0 Ad- 60), @€ [0,1], 


which is the pdf of the Beta(s + 1,n — s + 1) distribution. The normalization constant is 
(n+ 1)("). The posterior pdf is shown in Figure 2.14 for (s,n) = (0, 100). It is not difficult 


100 


80 F 


60+ Laer posterior mean 


8(9|7) 


95% credible interval 








0 0.02 0.04 0.06 0.08 0.1 
0 


Figure 2.14: Posterior pdf for 0, with n = 100 and s = 0. 


to see that the maximum a posteriori (MAP) estimate of @ (the mode or maximizer of the 
posterior density) is 


argmax g(0|T) = Ž, 
8 n 


which agrees with the maximum likelihood estimate. Figure 2.14 also shows that the left 
one-sided 95% credible interval for 0 is [0, 0.0292], where 0.0292 is the 0.95 quantile 
(rounded) of the Beta(1, 101) distribution. 

Observe that when (s, n) = (0, 100) the maximum likelihood estimate @ = 0 infers that 
deaths at birth are not possible. We know that this inference is wrong — the probability of 
death can never be zero, it is simply (and fortunately) too small to be inferred accurately 
from a sample size of n = 100. In contrast to the maximum likelihood estimate, the pos- 
terior mean E[8 |T] = (s + 1)/(n + 2) is not zero for (s,n) = (0, 100) and provides the more 
reasonable point estimate of 0.0098 for the probability of death. 
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In addition, while computing a Bayesian credible interval poses no conceptual diffi- 
culties, it is not simple to derive a confidence interval for the maximum likelihood estimate 
of 0, because the likelihood as a function of 0 is not differentiable at 6 = 0. As a result of 
this lack of smoothness, the usual confidence intervals based on the normal approximation 
cannot be used. a 


We now return to the unsupervised learning setting of Section 2.6, but consider this 
from a Bayesian perspective. Recall from (2.39) that the Kullback—Leibler risk for an ap- 
proximating function g is 


i= f ioje haide, 


where t’ denotes the test data. Since f f(T) In f(T) dt’, plays no role in minimizing the 
risk, we consider instead the cross-entropy risk, defined as 


&(¢) = - f EA 


Note that the smallest possible cross-entropy risk is f° = — f f(T) In f(T) dr}. The expec- 
ted generalization risk of the Bayesian learner can then be decomposed as 


E E REIT ap, 
Eler) = G+ | frying adr, +B | fe m EEGEN ar, 


“bias” component “variance” component 


where gr (T) = g(t’ |Tn) = f g(t’ | 0) g(8|7,,) dé is the posterior predictive density after 
observing T. 

Assuming that the sets 7,, and 7; are comprised of 2n iid random variables with density 
f, we can show (Exercise 23) that the expected generalization risk simplifies to 


E&(gr,) = Elng(7,) — Eln g(7 nn), (2.48) 


where g(T„) and g(t2,) are the prior predictive densities of 7, and T2,, respectively. 

Let 6, = = argmax, 2(0|T,) be the MAP estimator of 0° := = argmax, Eln g(X |0). As- 
suming that 6, converges to @* (with probability one) and iE In e(T,10,) = Elng(X | 6") + 
O(1/n), we can use the following large-sample approximation of the expected generaliza- 
tion risk. 


Theorem 2.4: Approximating the Bayesian Cross-Entropy Risk 
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Proof: To show (2.50), we apply Theorem C.21 to In f e™®g(0) d0, where 
1 1x as. 
r,(0) := —- ln g(T, |0) = -- X Ing(Xil0) —> -Elng(X|@) =: r(0) < œ. 
n n a 
This gives (with probability one) 
In i 8(T|8) 9(0) d0 = —nr(8") ~ È Inn). 


Taking expectations on both sides and using nr(@") = nE[r,(9,)] +O(1), we deduce (2.50). 
To demonstrate (2.49), we derive the asymptotic approximation of E In g(72,,) by repeating 
the argument for (2.50), but replacing n with 2n, where necessary. Thus, we obtain: 


E In 9(F 2) = —2nr(6") — È In(2n), 


Then, (2.49) follows from the identity (2.48). o 


The results of Theorem 2.4 have two major implications for model selection and assess- 
ment. First, (2.49) suggests that — In g(7,,) can be used as a crude (leading-order) asymp- 
totic approximation to the expected generalization risk for large n and fixed p. In this 
context, the prior predictive density g(7,,) is usually called the model evidence or marginal 
likelihood for the class G,. Since the integral f (7, |0) g(@) d@ is rarely available in closed 
form, the exact computation of the model evidence is typically not feasible and may require 
Monte Carlo estimation methods. 

Second, when the model evidence is difficult to compute via Monte Carlo methods or 
otherwise, (2.50) suggests that we can use the following large-sample approximation: 


~2E In g(J,) = —2Ing(TJ,, | 4,) + p In(n). (2.51) 


The asymptotic approximation on the right-hand side of (2.51) is called the Bayesian in- 
formation criterion (BIC). We prefer the class G, with the smallest BIC. The BIC is typic- 
ally used when the model evidence is difficult to compute and n is sufficiently larger than 
p. For a fixed p, and as n becomes larger and larger, the BIC becomes a more and more 
accurate estimator of —2E In g(7,,). Note that the BIC approximation is valid even when the 
true density f ¢ G,. The BIC provides an alternative to the Akaike information criterion 
(AIC) for model selection. However, while the BIC approximation does not assume that 
the true model f belongs to the parametric class under consideration, the AIC assumes 
that f € Gp. Thus, the AIC is merely a heuristic approximation based on the asymptotic 
approximations in Theorem 4.1. 

Although the above Bayesian theory has been presented in an unsupervised learn- 
ing setting, it can be readily extended to the supervised case. We only need to relabel 
the training set 7,,. In particular, when (as is typical for regression models) the train- 


ing responses Yj,..., Y, are considered as random variables but the corresponding fea- 
ture vectors X;,...,X, are viewed as being fixed, then T, is the collection of random re- 
sponses {Yj,..., Y,}. Alternatively, we can simply identify 7, with the response vector 


Y =[Y,,...,Y,]'. We will adopt this notation in the next example. 


Chapter 2. Statistical Learning 55 





E Example 2.10 (Polynomial Regression (cont.)) Consider Example 2.2 once again, but 
now in a Bayesian framework, where the prior knowledge on (a, B) is specified by 
g(a’) = 1/0? and B|o* ~ N(O,c7D), and D is a (matrix) hyperparameter. Let X := 
(XTX + D"')"!. Then the posterior can be written as: 





exp (- lyse") exp (-E2£) 


202 


(2n0-7)"/2 (270-7)?/2 D2 o? 


e(B. o° |y) = ko 


-eM ( IE 2B- (n+ p+2) =) 
oe A See gy), 


~ (Qn)+P)/2 |p] 1/2 202 20? 


where B := EX"y and F? := y7 (I - XEX”)y/(n + p + 2) are the MAP estimates of 8 and 
a, and g(y) is the model evidence for Gp: 


W= 1 i eB, 0, y) dB do? 





— 
= zr |, (or do 
IXI T(n/2) 


Damn + p +YP 
Therefore, based on (2.49), we have 
2EE(gr,) = -21n 8) = nIn|a(n + p + 2) F°] - 21InT(n/2) + 1n [D] - In|. 


On the other hand, the minus of the log-likelihood of Y can be written as 


E 2y — WAX ong? 
In g(y |B, 0°) = —S— + 5 Inno) 
-1/2 — RNN T 
-E PAI, REPT OT T naro?) 


Therefore, the BIC approximation (2.51) is 
-2 ln gy |B, F> + (p + 1) Inv) = narr + 1] + (p + 1) Inn) + (p +2), (2.52) 


where the extra In(n) term in (p + 1)In(n) is due to the inclusion of o? in @ = (o”,B). 
Figure 2.15 shows the model evidence and its BIC approximation, where we used a hyper- 
parameter D = 10* x I, for the prior density of B. We can see that both approximations 
exhibit a pronounced minimum at p = 4, thus identifying the true polynomial regression 
model. Compare the overall qualitative shape of the cross-entropy risk estimate with the 
shape of the square-error risk estimate in Figure 2.11. 
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Figure 2.15: The BIC and marginal likelihood used for model selection. 


It is possible to give the model complexity parameter p a Bayesian treatment, in which 
we define a prior density on the set of all models under consideration. For example, let 
g(p), p = 1,...,m be a prior density on m candidate models. Treating the model com- 
plexity index p as an additional parameter to 8 € R”, and applying Bayes’ formula, the 
posterior for (0, p) can be written as: 


8(9, p|T) = g(| p,T) x g(p|T) 
_ gt |9,p)g@lp) — g(tlp)g(p) l 


(|p) g(t) 
ee — mm 
posterior of 0 given model p posterior of model p 


The model evidence for a fixed p is now interpreted as the prior predictive density of T, 
conditional on the model p: 


g(t|p) = f 2216.» 8001 40. 


m 


and the quantity g(t) = }p-1 8(T| p) g(p) is interpreted as the marginal likelihood of all the 
m candidate models. Finally, a simple method for model selection is to pick the index p 
with the largest posterior probability: 


p = argmax g(p | T) = argmax g(t | p) 8(p). 
P p 


E Example 2.11 (Polynomial Regression (cont.)) Let us revisit Example 2.10 by giving 
the parameter p = 1,...,m, with m = 10, a Bayesian treatment. Recall that we used the 
notation T = y in that example. We assume that the prior g(p) = 1/m is flat and uninform- 
ative so that the posterior is given by 


Z|"? 1/2) 


g(p ly) ~ gy |p) = DiGi + p +2) F 
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where all quantities in g(y | p) are computed using the first p columns of X. Figure 2.16 
shows the resulting posterior distribution g(p |y). The figure also shows the posterior dens- 


ity ZO | p)/ X21 8O | p), where 


2 n{In(27 Go) + 1] + (p + 1) In(n) + (p + 2) 
gO lp) := ep- EOT ems 


is derived from the BIC approximation (2.52). In both cases, there is a clear maximum at 
p = 4, suggesting that a third-degree polynomial is the most appropriate model for the 
data. 
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Figure 2.16: Posterior probabilities for each polynomial model of degree p — 1. 


Suppose that we wish to compare two models, say model p = 1 and model p = 2. 
Instead of computing the posterior g(p|t) explicitly, we can compare the posterior odds 
ratio: 

8p=l1|t)_ gsp=l). gscip=)) 
g(p=2|t) g(p=2) gt|p=2)— 
oe ae 


Bayes factor Bı |2 





This gives rise to the Bayes factor Bi; j, whose value signifies the strength of the evidence 
in favor of model i over model j. In particular B; ; > 1 means that the evidence in favor for 
model 7 is larger. 


E Example 2.12 (Savage—Dickey Ratio) Suppose that we have two models. Model p = 
2 has a likelihood g(t|u,v, p = 2), depending on two parameters. Model p = 1 has the 
same functional form for the likelihood but now v is fixed to some (known) yo; that 
is, g(T|u, p = 1) = g(t |, Vv = Vo, p = 2). We also assume that the prior information on u 
for model 1 is the same as that for model 2, conditioned on v = vo. That is, we assume 


BAYES FACTOR 
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SAVAGE—DICKEY 
DENSITY RATIO 


g(u|p = 1) = g(ulv = vo, p = 2). As model 2 contains model 1 as a special case, the latter 
is said to be nested inside model 2. We can formally write (see also Exercise 26): 


erlp=1= f erly = Dauelp = Ddy 
= f s(rlny =v, = 2)ausly = vo. = 2) 


N EEEE M  — 


g(v = vo| p = 2) - 
Hence, the Bayes factor simplifies to 
_ gtlp=1) _ gv=volp =2) _ my _ 8(V = VolIT, p = 2) 
C=. = = lM 
stip=2) gv=vwolp = 2) 80v = vo| p = 2) 


In other words, B42 is the ratio of the posterior density to the prior density of v, evaluated at 
v = vo and both under the unrestricted model p = 2. This ratio of posterior to prior densities 
is called the Savage—Dickey density ratio. oO 


Whether to use a classical (frequentist) or Bayesian model is largely a question of con- 
venience. Classical inference is useful because it comes with a huge repository of ready- 
to-use results, and requires no (subjective) prior information on the parameters. Bayesian 
models are useful because the whole theory is based on the elegant Bayes’ formula, and 
uncertainty in the inference (e.g., confidence intervals) can be quantified much more nat- 
urally (e.g., credible intervals). A usual practice is to “Bayesify” a classical model, simply 
by adding some prior information on the parameters. 


Further Reading 


A popular textbook on statistical learning is [55]. Accessible treatments of mathematical 
statistics can be found, for example, in [69], [74], and [124]. More advanced treatments 
are given in [10], [25], and [78]. A good overview of modern-day statistical inference 
is given in [36]. Classical references on pattern classification and machine learning are 
[12] and [35]. For advanced learning theory including information theory and Rademacher 
complexity, we refer to [28] and [109]. An applied reference for Bayesian inference is [46]. 
For a survey of numerical techniques relevant to computational statistics, see [90]. 


Exercises 


1. Suppose that the loss function is the piecewise linear function 


Loss(y, y) = a-y) + B(y — Ya, a, B > 0, 


where c, is equal to c if c > 0, and zero otherwise. Show that the minimizer of the risk 
€(g) = ELoss(Y, g(X)) satisfies 


£ 


at+p 





PLY < g*(x)| X =x] = 


In other words, g*(x) is the 6/(@ + £) quantile of Y, conditional on X = x. 
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2. Show that, for the squared-error loss, the approximation error £(g%) — €(g*) in (2.16), is 
equal to E(g9(X) — g*(X))?. [Hint: expand £(¢9) = E(Y — g*(X) + 9*(X) — g9(X))*.] 


3. Suppose G is the class of linear functions. A linear function evaluated at a feature x can 
be described as g(x) = B' x for some parameter vector 8 of appropriate dimension. Denote 
g9(x) = x" BY and g9(x) = xB. Show that 
2 —~ 2 2 
E(8 - g'(X)) = E(X"B- X'p%) +E(X P — g"(X) . 
Hence, deduce that the statistical error in (2.16) is £(g%) — €(g%) = E (f(X) — g9(X). 
4. Show that formula (2.24) holds for the 0-1 loss with 0-1 response. 


5. Let X be an n-dimensional normal random vector with mean vector u and covariance 
matrix Ł, where the determinant of X is non-zero. Show that X has joint probability density 


1 1 Ty- 
f(x) = — OWED eR, 
Vr)" |X] 
6. Let B = A*y. Using the defining properties of the pseudo-inverse, show that for any rs 360 


PER, ae 
IAB — yll < ||AB — yll. 


7. Suppose that in the polynomial regression Example 2.1 we select the linear class of 

functions G, with p > 4. Then, g* € G, and the approximation error is zero, because 

gr (x) = g*(x) = xB, where B = [10, —140, 400, —250, 0,...,0]' € RP. Use the tower 

property to show that the learner g,(x) = x'B with B = X*y, assuming rank(X) > 4, is rs 431 

unbiased: UNBIASED 
E gr(x) = g'(x). 


8. (Exercise 7 continued.) Observe that the learner gy can be written as a linear combina- 
tion of the response variable: g7(x) = x'X*Y. Prove that for any learner of the form x' Ay, 
where A € RP?” is some matrix and that satisfies Ex[x" AY] = g*(x), we have 





Varx[x X*Y] < Varx[x AY], 


where the equality is achieved for A = X*. This is called the Gauss—Markov inequality. Gauss-MarKov 
Hence, using the Gauss—Markov inequality deduce that for the unconditional variance: site al 


Var g7(x) < Var[x AY]. 
Deduce that A = X* also minimizes the expected generalization risk. 


9. Consider again the polynomial regression Example 2.1. Use the fact that Ex B = X*h*(u), 
where h*(u) = E[Y|U = u] = [h*(u),...,h*(un)]", to show that the expected in-sample 
risk is: i r m ; 
— ||XX*h* aa 
Sg Eger) = C+ WE QI = IXXH OP CP. 
n n 


Also, use Theorem C.2 to show that the expected statistical error is: rs 430 


Ex B - PH, B - P) = f*tr(X*(X*)H,) + (X*h*(u) - p) H, (X*h"(w) — p). 
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10. Consider the setting of the polynomial regression in Example 2.2. Use Theorem C.19 
to prove that 


=~ d EE T 7 
vn (B, -B,) — N(0, ¢H,' + H,'M,H,'), (2.53) 
where M, := E[XX'(g9*(X) - 99(X))"] is the matrix with (i, j)-th entry: 
1 . . 
f u+ =h (u) — h* (u)? du, 
0 
and H;' is the p x p inverse Hilbert matrix with (i, j)-th entry: 
Ba | “i+ 7-2 
CDHG + j- vf” ae I Ko I W ’ | 
p-j p-i i-1 


Observe that M, = 0 for p > 4, so that the matrix M, term is due to choosing a restrictive 
class G, that does not contain the true prediction function. 


11. In Example 2.2 we saw that the statistical error can be expressed (see (2.20)) as 
1 = j _ pa 
S (B-B) au = B- 8, HB-B, 
0 


By Exercise 10 the random vector Z,, := vn, — B,,) has asymptotically a multivariate 
normal distribution with mean vector 0 and covariance matrix V := €*H;' + H,'M,H,". 
Use Theorem C.2 to show that the expected statistical error is asymptotically 








p $ tr(M,H,') 
n 


E(B -B,)'H,@-B,) ~ , noo. (2.54) 
Plot this large-sample approximation of the expected statistical error and compare it with 


the outcome of the statistical error. 
We note a subtle technical detail: In general, convergence in distribution does not imply 
convergence in L,-norm (see Example C.6), and so here we have implicitly assumed that 


||Z,|| > Dist. > ||Zl| > constant := limypoo El|Zall 


12. Consider again Example 2.2. The result in (2.53) suggests that EB > B,asn > ©, 
where £, is the solution in the class G, given in (2.18). Thus, the large-sample approxim- 


ation of the pointwise bias of the learner g?) = xB atx = [1,...,u?-']7 is 


Eg œ) — g(x) =[1,...,.w 1B, — [1,u, u, wp, no, 


Use Python to reproduce Figure 2.17, which shows the (large-sample) pointwise squared 
bias of the learner for p € {1,2,3}. Note how the bias is larger near the endpoints u = 0 
and u = 1. Explain why the areas under the curves correspond to the approximation errors. 
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Figure 2.17: The large-sample pointwise squared bias of the learner for p = 1,2,3. The 
bias is zero for p > 4. 


13. For our running Example 2.2 we can use (2.53) to derive a large-sample approximation 
of the pointwise variance of the learner g7(x) = x'B,. In particular, show that for large n 
x -1 -1 -1 
Cx'H,x x'H, M,H, x 


Var g7 (x) = 7 + 7 » N>. (2.55) 





Figure 2.18 shows this (large-sample) variance of the learner for different values of the 
predictor u and model index p. Observe that the variance ultimately increases in p and that 
it is smaller at u = 1/2 than closer to the endpoints u = 0 or u = 1. Since the bias is also 
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Figure 2.18: The pointwise variance of the learner for various pairs of p and u. 
larger near the endpoints, we deduce that the pointwise mean squared error (2.21) is larger 


near the endpoints of the interval [0, 1] than near its middle. In other words, the error is 
much smaller in the center of the data cloud than near its periphery. 





62 Exercises 
14. Leth : x + R be a convex function and let X be a random variable. Use the subgradi- 
r= 403 ent definition of convexity to prove Jensen’s inequality: 
JENSEN’ S 
INEQUALITY Eh(X) > h(EX). (2.56) 
15. Using Jensen’s inequality, show that the Kullback—Leibler divergence between prob- 
ability densities f and g is always positive; that is, 
X 
Bn ® >o, 
g(x) 
where X ~ f. 
VAPNIK— 16. The purpose of this exercise is to prove the following Vapnik—Chernovenkis bound: for 
ee any finite class G (containing only a finite number |G| of possible functions) and a general 


HOEFFDING’ S 
INEQUALITY 


us 427 


bounded loss function, l < Loss < u, the expected statistical error is bounded from above 


according to: 

(u— D) V2 In@|G\) 
yn 
Note how this bound conveniently does not depend on the distribution of the training set 
Ta (which is typically unknown), but only on the complexity (i.e., cardinality) of the class 
G. We can break up the proof of (2.57) into the following four parts: 


El(g? ) - &(g%) < (2.57) 


(a) For a general function class G, training set 7, risk function f, and training loss fy, 
we have, by definition, €(g%) < ¢€(g) and br(g2) < €;(g) for all g € G. Show that 


e2) — &(29) < sup f(g) — (g)| + Er (g%) — &(2%), 
gE 


where we used the notation sup (supremum) for the least upper bound. Since 
Ef;(g) = E€(g), we obtain, after taking expectations on both sides of the inequal- 
ity above: 
E &(g?) — &(g%) < Bsuplér(s) ~ (8) 
ge 
(b) If X is a zero-mean random variable taking values in the interval [/, u], then the fol- 
lowing Hoeffding’s inequality states that the moment generating function satisfies 


Pu -— D? 


Ee” < exp| 3 


| ter. (2.58) 
Prove this result by using the fact that the line segment joining points (l, exp(t/)) and 
(u, exp(tu)) bounds the convex function x + exp(tx) for x € [L, u]; that is: 


u-x 
e” <e” 





x-l 

+e” — lu]. 

kei ael EEL 

(c) Let Z1, ..., Zn be (possibly dependent and non-identically distributed) zero-mean ran- 
dom variables with moment generating functions that satisfy Eexp(tZ;,) < exp(tn?/2) 
for all k and some parameter 77. Use Jensen’s inequality (2.56) to prove that for any 
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t >00, 
1 1 tn? 
Emax Z; = -E lIn max e* < -lnn + a 
k t k t 2 
From this derive that 
E max Z, <nV2Inn. 
Finally, show that this last inequality implies that 
E max |Z| < n y2 In(2n). (2.59) 
(d) Returning to the objective of this exercise, denote the elements of G by 81, ... , gig), 


and let Z; = fr, (8x) — €(g,). By part (a) it is sufficient to bound E max, |Z;|. Show that 
the {Z,} satisfy the conditions of (c) with 7 = (u — 1)/-Vn. For this you will need to 
apply part (b) to the random variable Loss(g(X), Y) — €(g), where (X, Y) is a generic 
data point. Now complete the proof of (2.57). 


17. Consider the problem in Exercise 16a above. Show that 


Erg) - &(g%)| < 2 sup lf-(g) — €(g)| + €(g%) — E29). 
gE 


From this, conclude: 


E Er (8f) — &(g9)| < 2E sup lEz (8) - (8). 
gE 


The last bound allows us to assess how close the training loss tr(g?) is to the optimal risk 
€(g?) within class G. 


18. Show that for the normal linear model Y ~ N(Xf, c’I,,), the maximum likelihood es- 
timator of o° is identical to the method of moments estimator (2.37). 


19. Let X ~ Gamma(a, A). Show that the pdf of Z = 1/X is equal to 


Maye 1 e74 o! 


, 0. 
Ta) = 





20. Consider the sequence wo, w1, ..., where wọ = g(0) is a non-degenerate initial guess 
and w,(@) « w,1(@)g(T| 0), t > 1. We assume that g(r | 6) is not the constant function (with 
respect to 0) and that the maximum likelihood value 


g(r |) = max g(r] 8) < co 


exists (is bounded). Let 
b= f scion dé. 


Show that {/,} is a strictly increasing and bounded sequence. Hence, conclude that its limit 
is g(T| 6). 
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Exercises 





21. Consider the Bayesian model for T = {x1,..., Xn} with likelihood g(t |u) such that 
(X1,...,Xn|M) ~iia N(u, 1) and prior pdf g(u) such that u ~ N(y, 1) for some hyperpara- 
meter v. Define a sequence of densities w,(u),t > 2 via w(u) x w-ı(4)g(T |4), start- 
ing with wı(u) = g(u). Let a; and b; denote the mean and precision* of u under the 
posterior g,(u|T) ~ g(t|W)w-(u). Show that g,(u|T) is a normal density with precision 
b, = by, +n, bo = 1 and mean a, = (1 — y)a;-1 + ¥;Xn, Ao = v, Where y, := n/(b,_, + n). 
Hence, deduce that g,(u |T) converges to a degenerate density with a point-mass at x,. 


22. Consider again Example 2.8, where we have a normal model with improper prior 
2(9) = g(u, o°) « 1/0°. Show that the prior predictive pdf is an improper density g(x) « 1, 
but that the posterior predictive density is 


= —n/2 
(x = Xn)? 
1 + ———— 
saln «| (n+ 1)S2 
XX ~ 
Deduce that 3. VGDG=D tr-1- 


23. Assuming that X,...,X, x f, show that (2.48) holds and that & = -n E In f(X). 


24. Suppose that T = {x,,...,x,} are observations of iid continuous and strictly positive 
random variables, and that there are two possible models for their pdf. The first model 
p=lis 

g(x|6, p = 1) = exp (—6x) 


and the second p = 2 is 


20\"? x2 
g(x|0,p = 2) = (2) exp(-). 


For both models, assume that the prior for 8 is a gamma density 


Fei _ 
8) = Toy? exp (—b0), 


with the same hyperparameters b and t. Find a formula for the Bayes factor, g(t|p = 
1)/g(t| p = 2), for comparing these models. 


25. Suppose that we have a total of m possible models with prior probabilities g(p), p = 
1,...,m. Show that the posterior probability of model g(p |T) can be expressed in terms of 
all the p(p — 1) Bayes factors: 





-1 
g(p = Dp, . 
g(p = 1) 


ap=it=[1s 
jti 





4The precision is the reciprocal of the variance. 
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26. Given the data T = {x,,...,x,}, suppose that we use the likelihood (X10) ~ N(u, 07) 
with parameter 0 = (u, o°)" and wish to compare the following two nested models. 


(a) Model p = 1, where o° = ø$ is known and this is incorporated via the prior 


(u-x)" 


e 2 x óo? -— o>) 





1 
@lp=1)= 2 p=l -ln=1)= 
gO|p=1)=gtulo’,p=1)g(o |p= 1) oe 


(b) Model p = 2, where both mean and variance are unknown with prior 


2 
_ qx? b'(o?) E le?e 
e 20? L 


Nae Tp 


Show that the prior g(@| p = 1) can be viewed as the limit of the prior g(@| p = 2) when 
t— œ and b = to. Hence, conclude that 


g0 lp = 2) = gul o°) g(0°) = 





g(t|p = 1) = lim g(t|p = 2) 


gd 
b=to 


and use this result to calculate B; |2. Check that the formula for B; |2 agrees with the Savage— 
Dickey density ratio: 
a(t|p=2) = g(a’ = 05 


where g(o* |T) and g(a”) are the posterior and prior, respectively, under model p = 2. 
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CHAPTER 3 





MONTE CARLO METHODS 





Many algorithms in machine learning and data science make use of Monte Carlo 
techniques. This chapter gives an introduction to the three main uses of Monte Carlo 
simulation: to (1) simulate random objects and processes in order to observe their beha- 
vior, (2) estimate numerical quantities by repeated sampling, and (3) solve complicated 
optimization problems through randomized algorithms. 


3.1 Introduction 


Briefly put, Monte Carlo simulation is the generation of random data by means of a com- 
puter. These data could arise from simple models, such as those described in Chapter 2, 
or from very complicated models describing real-life systems, such as the positions of 
vehicles on a complex road network, or the evolution of security prices in the stock mar- 
ket. In many cases, Monte Carlo simulation simply involves random sampling from certain 
probability distributions. The idea is to repeat the random experiment that is described by 
the model many times to obtain a large quantity of data that can be used to answer questions 
about the model. The three main uses of Monte Carlo simulation are: 


Sampling. Here the objective is to gather information about a random object by observing 
many realizations of it. For instance, this could be a random process that mimics the 
behavior of some real-life system such as a production line or telecommunications 
network. Another usage is found in Bayesian statistics, where Markov chains are 
often used to sample from a posterior distribution. 


Estimation. In this case the emphasis is on estimating certain numerical quantities related 
to a simulation model. An example is the evaluation of multidimensional integrals 
via Monte Carlo techniques. This is achieved by writing the integral as the expecta- 
tion of a random variable, which is then approximated by the sample mean. Appeal- 
ing to the Law of Large Numbers guarantees that this approximation will eventually 
converge when the sample size becomes large. 


Optimization. Monte Carlo simulation is a powerful tool for the optimization of complic- 
ated objective functions. In many applications these functions are deterministic and 
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randomness is introduced artificially in order to more efficiently search the domain of 
the objective function. Monte Carlo techniques are also used to optimize noisy func- 
tions, where the function itself is random; for example, when the objective function 
is the output of a Monte Carlo simulation. 


The Monte Carlo method dramatically changed the way in which statistics is used in 
today’s analysis of data. The ever-increasing complexity of data requires radically different 
statistical models and analysis techniques from those that were used 20 to 100 years ago. 
By using Monte Carlo techniques, the data analyst is no longer restricted to using basic 
(and often inappropriate) models to describe data. Now, any probabilistic model that can 
be simulated on a computer can serve as the basis for statistical analysis. This Monte Carlo 
revolution has had an impact on both Bayesian and frequentist statistics. In particular, in 
frequentist statistics, Monte Carlo methods are often referred to as resampling techniques. 
An important example is the well-known bootstrap method [37], where statistical quantit- 
ies such as confidence intervals and P-values for statistical tests can simply be determined 
by simulation without the need of a sophisticated analysis of the underlying probability 
distributions; see, for example, [69] for basic applications. The impact on Bayesian statist- 
ics has been even more profound, through the use of Markov chain Monte Carlo (MCMC) 
techniques [87, 48]. MCMC samplers construct a Markov process which converges in dis- 
tribution to a desired (often high-dimensional) density. This convergence in distribution 
justifies using a finite run of the Markov process as an approximate random realization 
from the target density. The MCMC approach has rapidly gained popularity as a versat- 
ile heuristic approximation, partly due to its simple computer implementation and inbuilt 
mechanism to tradeoff between computational cost and accuracy; namely, the longer one 
runs the Markov process, the better the approximation. Nowadays, MCMC methods are 
indispensable for analyzing posterior distributions for inference and model selection; see 
also [50, 99]. 

The following three sections elaborate on these three uses of Monte Carlo simulation 
in turn. 


3.2 Monte Carlo Sampling 


In this section we describe a variety of Monte Carlo sampling methods, from the building 
block of simulating uniform random numbers to MCMC samplers. 


3.2.1 Generating Random Numbers 


At the heart of any Monte Carlo method is a random number generator: a procedure that 
produces a stream of uniform random numbers on the interval (0,1). Since such numbers 
are usually produced via deterministic algorithms, they are not truly random. However, for 
most applications all that is required is that such pseudo-random numbers are statistically 
indistinguishable from genuine random numbers U4, U>,... that are uniformly distributed 
on the interval (0,1) and are independent of each other; we write U1, U2,... ~ia UO, 1). 
For example, in Python the rand method of the numpy. random module is widely used for 
this purpose. 
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Most random number generators at present are based on linear recurrence relations. 
One of the most important random number generators is the multiple-recursive generator es 
(MRG) of order k, which generates a sequence of integers Xz, X,41,... via the linear recur- RECURSIVE 
GENERATOR 
rence 
Xı = (aX) +++ + aQX;-~) modm, t=k,k+1,... (3.1) 
for some modulus m and multipliers {a;,i = 1,...,k}. Here “mod” refers to the modulo op- niente 
eration: n mod m is the remainder when n is divided by m. The recurrence is initialized by DUUTRERS 
specifying k “seeds”, Xo, . . . , Xz-1. To yield fast algorithms, all but a few of the multipliers 
should be 0. When m is a large integer, one can obtain a stream of pseudo-random numbers 
Ux, Ux+i,... between 0 and 1 from the sequence Xz, Xg+1, . . .„ simply by setting U, = X,/m. 
It is also possible to set a small modulus, in particular m = 2. The output function for such 
modulo 2 generators is then typically of the form 
MODULO 2 
GENERATORS 


w 

—i 

U, = : Xie 
i=1 


for some w < k, e.g., w = 32 or 64. Examples of modulo 2 generators are the feedback shift 
register generators, the most popular of which are the Mersenne twisters; see, for example, 
[79] and [83]. MRGs with excellent statistical properties can be implemented efficiently 
by combining several simpler MRGs and carefully choosing their respective moduli and 
multipliers. One of the most successful is L’ Ecuyer’s MRG32k3a generator; see [77]. From 
now on, we assume that the reader has a sound random number generator available. 


3.2.2 Simulating Random Variables 


Simulating a random variable X from an arbitrary (that is, not necessarily uniform) distri- 
bution invariably involves the following two steps: 


1. Simulate uniform random numbers U,,...,U, on (0, 1) for some k = 1,2,.... 


2. Return X = g(U,,..., Ux), where g is some real-valued function. 


The construction of suitable functions g is as much of an art as a science. Many 
simulation methods may be found, for example, in [71] and the accompanying website 
www.montecarlohandbook.org. Two of the most useful general procedures for gen- 
erating random variables are the inverse-transform method and the acceptance—rejection 
method. Before we discuss these, we show one possible way to simulate standard normal 
random variables. In Python we can generate standard normal random variables via the 
randn method of the numpy . random module. 


E Example 3.1 (Simulating Standard Normal Random Variables) If X and Y are in- 
dependent standard normally distributed random variables (that is, X, Y ~ia N(O, 1)), then 
their joint pdf is 
1 igy 
fay) =e 2 ™, (ay) ER’, 
27 


which is a radially symmetric function. In Example C.2 we see that, in polar coordin- 
ates, the angle © that the random vector [X, Y]' makes with the positive x-axis is U(0, 27) 
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distributed (as would be expected from the radial symmetry) and the radius R has pdf 
frír) = rev! 2? r>0. Moreover, R and © are independent. We will see shortly, in Ex- 
ample 3.4, that R has the same distribution as V-2InU with U ~ U(0,1). So, to sim- 
ulate X, Y ~ia N(O, 1), the idea is to first simulate R and © independently and then return 
X = Rcos(@) and Y = Rsin(@) as a pair of independent standard normal random variables. 
This leads to the Box—Muller approach for generating standard normal random variables. 


Algorithm 3.2.1: Normal Random Variable Simulation: Box—Muller Approach 
output: Independent standard normal random variables X and Y. 

1 Simulate two independent random variables, U; and U2, from U(0, 1). 

2 X e (-2InU,)'/? cos(2aU2) 

3 Ye (-2InU,)!” sin(27U3) 

4 return X, Y 


Once a standard normal number generator is available, simulation from any n- 
dimensional normal distribution N(u, X) is relatively straightforward. The first step is to 
find an n x n matrix B that decomposes Æ into the matrix product BB". In fact there exist 
many such decompositions. One of the more important ones is the Cholesky decomposition, 
which is a special case of the LU decomposition; see Section A.6.1 for more information 
on such decompositions. In Python, the function cholesky of numpy. linalg can be used 
to produce such a matrix B. 

Once the Cholesky factorization is determined, it is easy to simulate X ~ N(u, X) as, 
by definition, it is the affine transformation u + BZ of an n-dimensional standard normal 
random vector. 


Algorithm 3.2.2: Normal Random Vector Simulation 
input: u, & 
output: X ~ N(w, X) 
1 Determine the Cholesky factorization X = BB’. 
2 Simulate Z = [Z),...,Z,]' by drawing Z1, ..., Zn ~iia N(O, 1). 
3 X- pu+BZ 
4 return X 


E Example 3.2 (Simulating from a Bivariate Normal Distribution) The Python code 
below draws N = 1000 iid samples from the two bivariate (n = 2) normal pdfs in Fig- 
ure 2.13. The resulting point clouds are given in Figure 3.1. 


bvnormal.py 





import numpy as np 
from numpy.random import randn 
import matplotlib.pyplot as plt 


0 
.0 #change to 0.8 for other plot 
= np.array([[1, r], [r, 1]]) 
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B = np.linalg.cholesky(Sigma) 
x = B @ randn(2,N) 
plt.scatter([x[0,:]],[x[1,:]], alpha =0.4, s = 4) 





Figure 3.1: 1000 realizations of bivariate normal distributions with means zero, variances 
1, and correlation coefficients 0 (left) and 0.8 (right). 
ia 


In some cases, the covariance matrix X has special structure which can be exploited to 
create even faster generation algorithms, as illustrated in the following example. 


E Example 3.3 (Simulating Normal Vectors in O(n’) Time) Suppose that the random 


vector X = [X),...,X,]' represents the values at times fp + kô, k = 0,...,n — 1 of a zero- 
mean Gaussian process (X(t), t > 0) that is weakly stationary, meaning that Cov(X(s), X(t)) iS 238 
depends only on t-s. Then clearly the covariance matrix of X, say A,,, is a symmetric Toep- 
litz matrix. Suppose for simplicity that Var X(t) = 1. Then the covariance matrix is in fact is 379 


a correlation matrix, and will have the following structure: 


1 dı eee An-2 Qn-1 
dı 1 ia dn-2 
A, := 
an-2 “is E ay 
An- an2 ° @ 1 


Using the Levinson—Durbin algorithm we can compute a lower diagonal matrix L, and 


a diagonal matrix D, in O(n’) time such that L, A, L; = D,,; see Theorem A.14. If we IS 383 
simulate Z„ ~ N(0, I„), then the solution X of the linear system: 
L, X =D!’ Z, 


has the desired distribution N(0, A,,). The linear system is solved in O(n?) time via forward 
substitution. oO 
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3.2.2.1 Inverse-Transform Method 


Let X be a random variable with cumulative distribution function (cdf) F. Let F~! denote 
the inverse! of F and U ~ U(0, 1). Then, 


PIF! (U) < x] =P[U < F(x)] = Fœ). (3.2) 
This leads to the following method to simulate a random variable X with cdf F: 


Algorithm 3.2.3: Inverse-Transform Method 
input: Cumulative distribution function F. 
output: Random variable X distributed according to F. 
1 Generate U from U(0, 1). 
2 X e F7}(U) 
3 return X 


The inverse-transform method works both for continuous and discrete distribu- 
tions. After importing numpy as np, simulating numbers 0,...,k — 1 according to 


probabilities po,..., Px-1 can be done via np.min(np.where(np.cumsum(p) > 
np.random.rand())), where p is the vector of the probabilities. 





E Example 3.4 (Example 3.1 (cont.)) One remaining issue in Example 3.1 was how to 
simulate the radius R when we only know its density fr(r) = rev! 2 r > 0. We can use the 
inverse-transform method for this, but first we need to determine its cdf. The cdf of R is, 
by integration of the pdf, 

Fn =1-e2, r>0, 


and its inverse is found by solving u = Fpg(r) in terms of r, giving 


F,\(u) = V-2Ind —- u), u€ (0,1). 


Thus R has the same distribution as ¥—2 In(1 — U), with U ~ U(O, 1). Since 1 — U also has 
a U(O, 1) distribution, R has also the same distribution as V—21n U. a 


3.2.2.2 Acceptance—Rejection Method 


The acceptance-rejection method is used to sample from a “difficult” probability density 
function (pdf) f(x) by generating instead from an “easy” pdf g(x) satisfying f(x) < C g(x) 
for some constant C > 1 (for example, via the inverse-transform method), and then ac- 
cepting or rejecting the drawn sample with a certain probability. Algorithm 3.2.4 gives the 
pseudo-code. 

The idea of the algorithm is to generate uniformly a point (X, Y) under the graph of the 
function Cg, by first drawing X ~ g and then Y ~ U(O, Cg(X)). If this point lies under the 
graph of f, then we accept X as a sample from f; otherwise, we try again. The efficiency 
of the acceptance-—rejection method is usually expressed in terms of the probability of 
acceptance, which is 1/C. 





'Every cdf has a unique inverse function defined by F~'(u) = inf{x : F(x) > u}. If, for each u, the 
equation F(x) = u has a unique solution x, this definition coincides with the usual interpretation of the 
inverse function. 
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Algorithm 3.2.4: Acceptance—Rejection Method 
input: Pdf g and constant C such that Cg(x) > f(x) for all x. 
output: Random variable X distributed according to pdf f. 

1 found < false 

2 while not found do 

3 Generate X from g. 


4 Generate U from U(0, 1) independently of X. 
5 Y — UCg(X) 

6 if Y < f(X) then found < true 

7 return X 


E Example 3.5 (Simulating Gamma Random Variables) Simulating random variables 
from a Gamma(a, 4) distribution is generally done via the acceptance-rejection method. 


Consider, for example, the Gamma distribution with œ = 1.3 and A = 5.6. Its pdf, ms 425 
AC xe Near 
=" 2 0, 
f(x) Ta) x 


where IT is the gamma function I'(q@) := i e*x*"! dx, a > 0, is depicted by the blue solid 
curve in Figure 3.2. 


h — f(z) 
4} -- Cg(zx) 











Figure 3.2: The pdf g of the Exp(4) distribution multiplied by C = 1.2 dominates the pdf f 
of the Gamma(1.3, 5.6) distribution. 


This pdf happens to lie completely under the graph of Cg(x), where C = 1.2 and 
g(x) = 4exp(—4x),x > 0 is the pdf of the exponential distribution Exp(4). Hence, we 
can simulate from this particular Gamma distribution by accepting or rejecting a sample 
from the Exp(4) distribution according to Step 6 of Algorithm 3.2.4. Simulating from the ms 425 
Exp(4) distribution can be done via the inverse-transform method: simulate U ~ U(0, 1) 
and return X = — In(U)/4. The following Python code implements Algorithm 3.2.4 for this 
example. 
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accrejgamma.py 
from math import exp, gamma, log 
from numpy.random import rand 
= lambda x: lam**alpha * x**(Calpha-1) * exp(-lam*x)/gamma(alpha) 
= lambda x: lam*exp(-lam*x) 
za? 
found = False 
while not found: 
x = - log(rand())/lam 
if C*g(x)*rand() <= f(x): 
found = True 
print (x) 
m 
3.2.3 Simulating Random Vectors and Processes 
Techniques for generating random vectors and processes are as diverse as the class of 
random processes themselves; see, for example, [71]. We highlight a few general scenarios. 
When X,,...,X, are independent random variables with pdfs f; i = 1,...,n, so that 
I7 429 their joint pdf is f(x) = fix) +- faxnr), the random vector X = [X,...,X,]' can be 
simply simulated by drawing each component X; ~ f; individually — for example, via the 
inverse-transform method or acceptance-rejection. 
For dependent components X,,...,X,, we can, as a consequence of the product rule of 
r= 43] probability, represent the joint pdf f(x) as 
FR) = f.s An) = A) foe | 41) +++ fan M15 «+s M1), (3.3) 
where fi(xı) is the marginal pdf of X, and f(x, | x1,...,X%%-1) is the conditional pdf of X; 
given X; = x1, X2 = X2,...,X,-1 = Xg-1. Provided the conditional pdfs are known, one can 
generate X by first generating X4, then, given X; = xı, generate Xz from f(x | x1), and so 
on, until generating X, from fal(Xn| X1,- ., Xn-1). 
The latter method is particularly applicable for generating Markov chains. Recall from 
is 451 Section C.10 that a Markov chain is a stochastic process {X,,t = 0,1,2,...} that satisfies 


MARKOV CHAIN 


the Markov property; meaning that for all t and s the conditional distribution of X,+, given 
Xw u <S t, is the same as that of Xs, given only X,. As a result, each conditional density 
fial Xi, ..., X1) can be written as a one-step transition density q,(x;|x;-1); that is, the 
probability density to go from state x to state y in one step. In many cases of interest the 
chain is time-homogeneous, meaning that the transition density q, does not depend on t. 
Such Markov chains can be generated sequentially, as given in Algorithm 3.2.5. 
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Algorithm 3.2.5: Simulate a Markov Chain 
input: Number of steps N, initial pdf fo, transition density q. 
1 Draw Xo from the initial pdf fo. 
2 fort = 1 to N do 
3 | Draw X, from the distribution corresponding to the density q(: | X;-1) 


4 return Xo,..., Xy 


E Example 3.6 (Markov Chain Simulation) For time-homogeneous Markov chains 
with a discrete state space, we can visualize the one-step transitions by means of a trans- 
ition graph, where arrows indicate possible transitions between states and the labels de- 
scribe the corresponding probabilities. Figure 3.3 shows (on the left) the transition graph 
of the Markov chain {X,,t = 0, 1,2, ...} with state space {1,2,3,4} and one-step transition 
matrix 


0 0.2 0.5 0.3 
05 0 05 O 
03 07 0 0 
01 0 0 09 





l l i 
40 60 80 100 








Figure 3.3: The transition graph (left) and a typical path (right) of the Markov chain. 


In the same figure (on the right) a typical outcome (path) of the Markov chain is 
shown. The path was simulated using the Python program below. In this implementation 
the Markov chain always starts in state 1. We will revisit Markov chains, and in particular 
Markov chains with continuous state spaces, in Section 3.2.5. 










import numpy as np 
import matplotlib.pyplot as plt 


n = 101 

P = np.array([[0, 0.2, 0.5, 0.3], 
[0.5, 0, 0.5, O], 
[0.3, 0.7, 0, 0], 
[0.1, ©, ©, 0.9]]) 

x = np.array(np.ones(n, dtype=int)) 

x[0] = 0 


for t in range(0,n-1): 
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x[t+1] = np.min(np.where(np.cumsum(P[x[t],:]) > 
np.random.rand())) 
x = x + 1 #add 1 to all elements of the vector x 
plt.plot(np.array(range(0,n)),x, 'o') 
plt.plot(np.array(range(0,n)),x, '--') 
plt.show() 





3.2.4 Resampling 


The idea behind resampling is very simple: an iid sample T := {x1,..., Xn} from some 
unknown cdf F represents our best knowledge of F if we make no further a priori as- 
sumptions about it. If it is not possible to simulate more samples from F, the best way to 
“repeat” the experiment is to resample from the original data by drawing from the empir- 
ical cdf F,„; see (1.2). That is, we draw each x; with equal probability and repeat this N 
times, according to Algorithm 3.2.6 below. As we draw here “with replacement’, multiple 
instances of the original data points may occur in the resampled data. 


Algorithm 3.2.6: Sampling from an Empirical Cdf. 
input: Original iid sample x,,...,x, and sample size N. 
output: lid sample X;,..., X% from the empirical cdf. 

1 fort = 1 to N do 

2 Draw U ~ U(0, 1) 

3 Set Z — [nU] 

4 Set Xf — xı 

5 return X, ..., X% 

In Step 3, [nU] returns the ceiling of nU; that is, it is the smallest integer larger than 
or equal to nU. Consequently, Z is drawn uniformly at random from the set of indices 
{1,... 7}. 

By sampling from the empirical cdf we can thus (approximately) repeat the experiment 
that gave us the original data as many times as we like. This is useful if we want to assess 
the properties of certain statistics obtained from the data. For example, suppose that the 
original data T gave the statistic #(t). By resampling we can gain information about the 
distribution of the corresponding random variable f(T ). 


E Example 3.7 (Quotient of Uniforms) Let U;,...,Un,Vi,..., Vn be iid UO, 1) random 
variables and define X; = U;/V;, i = 1,...,n. Suppose we wish to investigate the distribu- 
tion of the sample median X and sample mean X of the (random) data T := {X,,..., Xn} 
Since we know the model for 7 exactly, we can generate a large number, N say, of inde- 
pendent copies of it, and for each of these copies evaluate the sample medians X\,..., Xy 
and sample means X iaat x n. For n = 100 and N = 1000 the empirical cdfs might look 
like the left and right curves in Figure 3.4, respectively. Contrary to what you might have 
expected, the distributions of the sample median and sample mean do not match at all. The 
sample median is quite concentrated around 1, whereas the distribution of the sample mean 
is much more spread out. 
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Figure 3.4: Empirical cdfs of the medians of the resampled data (left curve) and sample 
means (right curve) of the resampled data. 


Instead of sampling completely new data, we could also reuse the original data by 
resampling them via Algorithm 3.2.6. This gives independent copies Xj,...,X, and 


—* 


Rages wa My n» for which we can again plot the empirical cdf. The results will be similar 
to iie previous case. In fact, in Figure 3.4 the cdf of the resampled sample medians and 
sample means are plotted. The corresponding Python code is given below. The essential 
point of this example is that resampling of data can greatly add to the understanding of the 
probabilistic properties of certain measurements on the data, even if the underlying model 
is not known. See Exercise 12 for a further investigation of this example. 


quotunif.py 


import numpy as np 

from numpy.random import rand, choice 

import matplotlib.pyplot as plt 

from statsmodels.distributions.empirical_distribution import ECDF 


= rand(n)/rand(n) # data 
med = np.zeros(N) 
ave = np.zeros(N) 
for i in range(Q,N): 
s = choice(x, n, replace=True) # resampled data 
med[i] = np.median(s) 
ave[i] = np.mean(s) 


med_cdf = ECDF (med) 

ave_cdf = ECDF (ave) 
plt.plot(med_cdf.x, med_cdf.y) 
plt.plot(Cave_cdf.x, ave_cdf.y) 
plt.show() 
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3.2.5 Markov Chain Monte Carlo 


Markov chain Monte Carlo (MCMC) is a Monte Carlo sampling technique for (approxim- 
ately) generating samples from an arbitrary distribution — often referred to as the target 
distribution. The basic idea is to run a Markov chain long enough such that its limiting 
distribution is close to the target distribution. Often such a Markov chain is constructed to 
be reversible, so that the detailed balance equations (C.43) can be used. Depending on the 
starting position of the Markov chain, the initial random variables in the Markov chain may 
have a distribution that is significantly different from the target (limiting) distribution. The 
random variables that are generated during this burn-in period are often discarded. The 
remaining random variables form an approximate and dependent sample from the target 
distribution. 

In the next two sections we discuss two popular MCMC samplers: the Metropolis— 
Hastings sampler and the Gibbs sampler. 


3.2.5.1 Metropolis—Hastings Sampler 


The Metropolis—Hastings sampler [87] is similar to the acceptance-—rejection method in 
that it simulates a trial state, which is then accepted or rejected according to some random 
mechanism. Specifically, suppose we wish to sample from a target pdf f(x), where x takes 
values in some d-dimensional set. The aim is to construct a Markov chain {X,,t = 0,1,...} 
in such a way that its limiting pdf is f. Suppose the Markov chain is in state x at time t. A 
transition of the Markov chain from state x is carried out in two phases. First a proposal 
state Y is drawn from a transition density g(-|x). This state is accepted as the new state, 
with acceptance probability 


f(y) d(x |y) i}. (3.4) 


a(x,y) = min { ; 

F(x) gy |x) 
or rejected otherwise. In the latter case the chain remains in state x. The algorithm just 
described can be summarized as follows. 


Algorithm 3.2.7: Metropolis—Hastings Sampler 
input: Initial state Xo, sample size N, target pdf f(x), proposal function q(x, y). 
output: X,,..., Xx (dependent), approximately distributed according to f(x). 
1 fort = O to N — 1 do 
2 Draw Y ~ q(y|X;) // draw a proposal 
3 a — a(X, Y) // acceptance probability as in (3.4) 
4 Draw U ~ U(0, 1) 
5 if U < a then X,,, — Y 
6 else Xap <— X; 





7 return X,,...,Xjy 


The fact that the limiting distribution of the Metropolis—Hastings Markov chain is equal 
to the target distribution (under general conditions) is a consequence of the following result. 
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Theorem 3.1: Local Balance for the Metropolis—Hastings Sampler 





Proof: We prove the theorem for the discrete case only. Because a transition of the 
Metropolis—Hastings Markov chain consists of two steps, the one-step transition probabil- 
ity to go from x to y is not g(y |x) but 


= _ qQ |x) a(x, y), ify +x, 
ae f — Veer (Z|x)a(x,z), ify =x. ee) 


We thus need to show that 


f(x) qiy |x) = f(y) gq(x|y) forall x,y. (3.6) 
With the acceptance probability as in (3.4), we need to check (3.6) for three cases: 


(a) x=y, 
(b) x + y and f(y)q(x|y) < f(x)q(y |x), and 
(c) x + y and f(y)q(x|y) > f(x)q(y |x). 


Case (a) holds trivially. For case (b), a(x, y) = fiy)q(x|y)/(f(x)qy|x)) and a(y,x) = 1. 
Consequently, 


gy |x) = fyg(xly)/f(x) and g(xly) = qŒ ly), 


so that (3.6) holds. Similarly, for case (c) we have a(x, y) = 1 and a(y, x) = f(x)q(y|x)/ 
(f(y)q(x | y)). It follows that, 


qy|x)=qyl|x) and gly) = f(x)qy|x)/fO), 
so that (3.6) holds again. o 


Thus if the Metropolis—Hastings Markov chain is ergodic, then its limiting pdf is f(x). 
A fortunate property of the algorithm, which is important in many applications, is that in 
order to evaluate the acceptance probability a(x, y) in (3.4), one only needs to know the 
target pdf f(x) up to a constant; that is f(x) = c f(x) for some known function f(x) but 
unknown constant c. 

The efficiency of the algorithm depends of course on the choice of the proposal trans- 
ition density g(y|x). Ideally, we would like g(y|x) to be “close” to the target f(y), irre- 
spective of x. We discuss two common approaches. 


1. Choose the proposal transition density g(y|x) independent of x; that is, g(y|x) = 
g(y) for some pdf g(y). An MCMC sampler of this type is called an independence 
sampler. The acceptance probability is thus 


FOs) i}. 
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2. If the proposal transition density is symmetric (that is, g(y|x) = q(x |y)), then the 
acceptance probability has the simple form 


a(x,y) = min {2 i} ; (3.7) 


and the MCMC algorithm is called a random walk sampler. A typical example is 
when, for a given current state x, the proposal state Y is of the form Y = x + Z, 
where Z is generated from some spherically symmetric distribution, such as N(0, I). 


We now give an example illustrating the second approach. 


E Example 3.8 (Random Walk Sampler) Consider the two-dimensional pdf 


f(X1,%2) = cet VIS (sin(2,/x2 +22)41), -27r <x, < 2m, -2r < x < 27, (3.8) 
1 2 


where c is an unknown normalization constant. The graph of this pdf (unnormalized) is 
depicted in the left panel of Figure 3.5. 




















Figure 3.5: Left panel: the two-dimensional target pdf. Right panel: points from the random 
walk sampler are approximately distributed according to the target pdf. 


The following Python program implements a random walk sampler to (approximately) 
draw N = 10* dependent samples from the pdf f. At each step, given a current state x, 
a proposal Y is drawn from the N(x, I) distribution. That is, Y = x + Z, with Z bivariate 
standard normal. We see in the right panel of Figure 3.5 that the sampler works correctly. 
The starting point for the Markov chain is chosen as (0,0). Note that the normalization 
constant c is never required to be specified in the program. 










import numpy as np 

import matplotlib.pyplot as plt 

from numpy import pi, exp, sqrt, sin 
from numpy.random import rand, randn 


N = 10000 
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a = lambda x: -2*pi < x 
b = lambda x: x < 2* pi 
f = lambda x1, x2: exp(-sqrt(x1**2+x2**2)/4)*( 


sin(2*sqrt (x1**2+x2**2))+1)*a(xl)*b(x1) *a(Cx2)*b(x2) 


Xx = np.zeros((N,2)) 
xX = np.zeros((1,2)) 
for i in range(1,N): 
y = x + randn(1,2) 
alpha = np.amin((f(y[0][0],y[0][1])/f£Cx[0] [0] ,x[0][1]) ,1)) 
r = rand() < alpha 
x = r*y + (1l-r)*x 
xli N F X 


plt.scatter(xx[:,0], xx[:,1], alpha =0.4,s =2) 
plt.axis('equal') 
plt.show() 





3.2.5.2 Gibbs Sampler 


The Gibbs sampler [48] uses a somewhat different methodology from the Metropolis- GIBBS SAMPLER 
Hastings algorithm and is particularly useful for generating n-dimensional random vectors. 

The key idea of the Gibbs sampler is to update the components of the random vector 

one at a time, by sampling them from conditional pdfs. Thus, Gibbs sampling can be 

advantageous if it is easier to sample from the conditional distributions than from the joint 

distribution. 


Specifically, suppose that we wish to sample a random vector X = [X),...,X,]' ac- 
cording to a target pdf f(x). Let f(%;|%1,...,%j-1, Xi+1,---,Xn) represent the conditional 
pdf? of the i-th component, X;, given the other components x),...,X;-1, Xil; -< - Xn. The 


Gibbs sampling algorithm is as follows. 


Algorithm 3.2.8: Gibbs Sampler 
input: Initial point Xo, sample size N, and target pdf f. 
output: X,,..., Xx approximately distributed according to f. 
1 fort = O to N — 1 do 
2 Draw Y, from the conditional pdf f (y1 | Xi2,..., Xin). 
3 for i = 2 to n do 
4 | Draw Y; from the conditional pdf fil Y1,..., Yi-1, Xri+1s <- <, Xin). 


5 Xni Y 
6 return X,,...,Xy 
There exist many variants of the Gibbs sampler, depending on the steps required to 
update X, to X,,; — called the cycle of the Gibbs algorithm. In the algorithm above, the 


cycle consists of Steps 2-5, in which the components are updated in a fixed order 1 — 2 > 
--. — n. For this reason Algorithm 3.2.8 is also called the systematic Gibbs sampler. 


CYCLE 


SYSTEMATIC 
GIBBS SAMPLER 





“In this section we employ a Bayesian notation style, using the same letter f for different (conditional) 
densities. 
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In the random-order Gibbs sampler, the order in which the components are updated 
in each cycle is a random permutation of {1,...,} (see Exercise 9). Other modifications 
are to update the components in blocks (i.e., several at the same time), or to update only 
a random selection of components. The variant where in each cycle only a single random 
component is updated is called the random Gibbs sampler. In the reversible Gibbs sampler 
a cycle consists of the coordinate-wise updating 1 > 2 > -> n-l >n >n- > 

- — 2 — l. In all cases, except for the systematic Gibbs sampler, the resulting Markov 
chain {X,,t = 1,2,...} is reversible and hence its limiting distribution is precisely f(x). 

Unfortunately, the systematic Gibbs Markov chain is not reversible and so the detailed 
balance equations are not satisfied. However, a similar result holds, due to Hammersley and 
Clifford, under the so-called positivity condition: if at a point x = (x),...,X,) all marginal 
densities f(x;) > 0,i = 1,...,n, then the joint density f(x) > 0. 


Theorem 3.2: Hammersley—Clifford Balance for the Gibbs Sampler 





Proof: For the forward move we have: 


Jin |X) =F (V1 Mais sxe Me OD Vin tags) 22 FfOn lY- -s Yn-1), 


and for the reverse move: 


n1 XY) = frl Yis- - Yn-DfXn-1 iss ca Yn-2 Xn) TO | X2,- cae Xn). 


Consequently, 


Gion(y |x) — MO 
Qno1(X |y) = TAA tas -s Yi-1 Mis ss -> Xn) 


2 fOis- -<s Yi-1s Xis <- <, Xn) 

f(y) IEE SO cons Yis Xirs- Xn) 

F 2 (eres ee -+3 Xn) 
RITAREN EEE fo) 


POI Oai ea fœ 


= o aa 
i=1 





The result follows by rearranging the last identity. The positivity condition ensures that we 
do not divide by 0 along the line. o 


Intuitively, the long-run proportion of transitions x — y for the “forward move” chain 
is equal to the long-run proportion of transitions y — x for the “reverse move” chain. 
To verify that the Markov chain Xo, X,,... for the systematic Gibbs sampler indeed has 
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limiting pdf f(x), we need to check that the global balance equations (C.42) hold. By 
integrating (in the continuous case) both sides in (3.9) with respect to x, we see that indeed 


[ fe) ans ex = f(y). 


E Example 3.9 (Gibbs Sampler for the Bayesian Normal Model) Gibbs samplers are 
often applied in Bayesian statistics, to sample from the posterior pdf. Consider for instance 
the Bayesian normal model 


fuo’) = 1/0? 
(x |u,o°) ~ Nal, o°D. 


Here the prior for (u, o°) is improper. That is, it is not a pdf in itself, but by obstinately 
applying Bayes’ formula it does yield a proper posterior pdf. In some sense this prior 
conveys the least amount of information about u and o°. Following the same procedure as 
in Example 2.8, we find the posterior pdf: 


2 
fu, 07 |x) x (0?) apd- 1 dil = s p) La) (3.10) 


oO 


Note that u and o°? here are the “variables” and x is a fixed data vector. To simulate samples 
u and o° from (3.10) using the Gibbs sampler, we need the distributions of both (u|o7, x) 
and (o° |u, x). To find f(u|o?,x), view the right-hand side of (3.10) as a function of u 
only, regarding o° as a constant. This gives 


nu? — 2p Y; Xi po — 2px 
fulo, x) œ exp | SS} = apf- IIA | 


2 
x exp{- ee ot (3.11) 


2 c/n 


This shows that (u |@?, x) has a normal distribution with mean X and variance 0 /n. 
Similarly, to find f(o?|,x), view the right-hand side of (3.10) as a function of o°, 
regarding u as a constant. This gives 


F |u, x) (0?) exp l-; Xæ- přie| i (3.12) 
i=1 


showing that (o7|,x) has an inverse-gamma distribution with parameters n/2 and 
Èx- W° /2. The Gibbs sampler thus involves the repeated simulation of 


ulo’, x) ~ N(X, 07? /n) and (07 |p, x) ~ ivamma n/2 Xæ- we), 


i=1 


Simulating X ~ InvGamma(a, 4) is achieved by first generating Z ~ Gamma(a, A) and 
then returning X = 1/Z. 
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In our parameterization of the Gamma(a, 4) distribution, 4 is the rate parameter. 
Many software packages instead use the scale parameter c = 1/A. Be aware of this 


when simulating Gamma random variables. 





The Python script below defines a small data set of size n = 10 (which was randomly 
simulated from a standard normal distribution), and implements the systematic Gibbs 
sampler to simulate from the posterior distribution, using N = 10° samples. 


gibbsamp.py 


import numpy as np 
import matplotlib.pyplot as plt 


x = np.array([[-09.9472, 0.5401, -0.2166, 1.1890, 1.3170, 
-0.4056, -0.4449, 1.3284, 0.8338, 0.6044]]) 

n=x.size 

sample_mean = np.mean(x) 

sample_var = np.var(x) 

sig2 = np.var(x) 

mu=sample_mean 


N=10**5 

gibbs_sample = np.array(np.zeros((N, 2))) 

for k in range(N): 
mu=sample_mean + np.sqrt(sig2/n)*np.random.randn() 
V=np.sum((x-mu) **2)/2 
sig2 = 1/np.random.gamma(n/2, 1/V) 
gibbs_sample[k,:]= np.array([mu, sig2]) 
scatter(gibbs_sample[:,0], gibbs_sample[:,1],alpha =0.1,s =1) 


.plot(np.mean(x), np.var(x),'wo') 
show () 














Figure 3.6: Left: approximate draws from the posterior pdf f(u,o?|x) obtained via the 
Gibbs sampler. Right: estimate of the posterior pdf f(u |x). 
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The left panel of Figure 3.6 shows the (u, o°) points generated by the Gibbs sampler. 
Also shown, via the white circle, is the point (x, s”), where x = 0.3798 is the sample mean 
and s? = 0.6810 the sample variance. This posterior point cloud visualizes the considerable 
uncertainty in the estimates. By projecting the (u, o°) points onto the y-axis — that is, 
by ignoring the o° values — one obtains (approximate) samples from the posterior pdf 
of u; that is, f(u|x). The right panel of Figure 3.6 shows a kernel density estimate (see 
Section 4.4) of this pdf. The corresponding 0.05 and 0.95 sample quantiles were found to 
be —0.2054 and 0.9662, respectively, giving the 95% credible interval (—0.2054, 0.9662) 
for u, which contains the true expectation 0. Similarly, an estimated 95% credible interval 
for a is (0.3218, 2.2485), which contains the true variance 1. 


3.3 Monte Carlo Estimation 


In this section we describe how Monte Carlo simulation can be used to estimate complic- 
ated integrals, probabilities, and expectations. A number of variance reduction techniques 
are introduced as well, including the recent cross-entropy method. 


3.3.1 Crude Monte Carlo 


The most common setting for Monte Carlo estimation is the following: Suppose we wish to 
compute the expectation u = EY of some (say continuous) random variable Y with pdf f, 
but the integral EY = f yf(y)dy is difficult to evaluate. For example, if Y is a complicated 
function of other random variables, it would be difficult to obtain an exact expression for 
f(y). The idea of crude Monte Carlo — sometimes abbreviated as CMC — is to approx- 
imate u by simulating many independent copies Y;,..., Yy of Y and then take their sample 
mean Y as an estimator of u. All that is needed is an algorithm to simulate such copies. 

By the Law of Large Numbers, Y converges to u as N — œ, provided the expectation 
of Y exists. Moreover, by the Central Limit Theorem, Y approximately has a N(u,07/N) 
distribution for large N, provided that the variance o? = VarY < oo. This enables the con- 
struction of an approximate (1 — œ) confidence interval for u: 


= S = S 
(P-an P+ erans] : (3.13) 


VN VN 
where S is the sample standard deviation of the {Y;} and z, denotes the y-quantile of the 
N(O, 1) distribution; see also Section C.13. Instead of specifying the confidence interval, 
one often reports only the sample mean and the estimated standard error: S/N, or the 
estimated relative error: S/(Y VN). The basic estimation procedure for independent data 
is summarized in Algorithm 3.3.1 below. 


It is often the case that the output Y is a function of some underlying random vector or 
stochastic process; that is, Y = H(X), where H is a real-valued function and X is a random 
vector or process. The beauty of Monte Carlo for estimation is that (3.13) holds regardless 
of the dimension of X. 
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Algorithm 3.3.1: Crude Monte Carlo for Independent Data 
input: Simulation algorithm for Y ~ f, sample size N, confidence level 1 — a. 
output: Point estimate and approximate (1 — œ) confidence interval for u = EY. 
1 Simulate Y;,..., Yy n f. 
2Ye * A 
3 S? e hH DY- YY 
4 return Y and the interval (3.13). 


E Example 3.10 (Monte Carlo Integration) In Monte Carlo integration, simulation is 
used to evaluate complicated integrals. Consider, for example, the integral 


oo oo oo DEE 
H= i f { vix1 +X. + x3| e (xi+x3+x3)/2 dx, dx dx3. 


Defining Y = |X, + X% + X3|!/2(2m)3/2, with X1, X2, X3 ~ N(0, 1), we can write u = EY. 


Using the following Python program, with a sample size of N = 10°, we obtained an 
estimate Y = 17.031 with an approximate 95% confidence interval (17.017, 17.046). 


mcint.py 


import numpy as np 
from numpy import pi 


Q* pi) == 372) 
lambda x: c*np.sqrt(np.abs(np.sum(x,axis=1))) 
10**6 
1.96 
np.random.randn(N, 3) 
H(x) 
np.mean(y) 
np.std(y) 
sY/mY/np.sqrt(N) 
Print( Estimate = {23.3}, CI = Ci23.3f) ,4:3:.32)) format ( 
mY, mY*(1-z*RE), mY*(1+z*RE))) 


Estimate = 17.031, CI = (17.017,17.046) 





E Example 3.11 (Example 2.1 (cont.)) We return to the bias—variance tradeoff in Ex- 
ample 2.1. Figure 2.7 gives estimates of the (squared-error) generalization risk (2.5) as 
a function of the number of parameters in the model. But how accurate are these estim- 
ates? Because we know in this case the exact model for the data, we can use Monte Carlo 
simulation to estimate the generalization risk (for a fixed training set) and the expected 
generalization risk (averaged over all training sets) precisely. All we need to do is repeat 
the data generation, fitting, and validation steps many times and then take averages of the 
results. The following Python code repeats 100 times: 


1. Simulate the training set of size n = 100. 


2. Fit models up to size k = 8. 
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3. Estimate the test loss using a test set with the same sample size n = 100. 


Figure 3.7 shows that there is some variation in the test losses, due to the randomness in 
both the training and test sets. To obtain an accurate estimate of the expected generalization 
risk (2.6), take the average of the test losses. We see that for k < 8 the estimate in Figure 2.7 
is close to the true expected generalization risk. 
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Figure 3.7: Independent estimates of the test loss show some variability. 


CMCtestloss.py 






import numpy as np, matplotlib.pyplot as plt 
from numpy.random import rand, randn 
from numpy.linalg import solve 







def generate_data(beta, sig, n): 
u = rand(n, 1) 
y = (u ** np.arange(0, 4)) @ beta + sig * randn(n, 1) 
return u, y 







beta = np.array([[10, -140, 400, -250]]).T 
n = 100 

sig = 5 

betahat = {} 

plt.figure(figsize=[6,5]) 

totMSE = np.zeros(8) 

max_p = 8 

p_range = 








np.arange(1, max_p + 1, 1) 






for N in range(0,100): 
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u, y = generate_data(beta, sig, n) #training data 
X = np.ones((n, 1)) 
for p in p_range: 
1 pies 
X = np.hstack((X, u**(p-1))) 
betahat[p] = solve(X.T @ X, X.T @ y) 
u_test, y_test = generate_data(beta, sig, n) #test data 
MSE =f] 
X_test = np.ones((n, 1)) 
for p in p_range: 
if p> I: 
X_test = np.hstack((X_test, u_test**(p-1))) 
y_hat = X_test @ betahat[p] # predictions 
MSE . append (np . sum((y_test - y_hat)**2/n)) 
totMSE = totMSE + np.array (MSE) 
plt.plot(p_range, MSE,'C0',alpha=0.1) 
plt.plot(p_range,totMSE/N,'r-o') 
plt.xticks(ticks=p_range) 
plt.xlabel('Number of parameters $p$') 
plt.ylabel('Test loss') 
plt.tight_layout O 
plt.savefig('MSErepeatpy.pdf',format='pdf') 
plt.show() 
3.3.2 Bootstrap Method 
The bootstrap method [37] combines CMC estimation with the resampling procedure of 
r= 76 Section 3.2.4. The idea is as follows: Suppose we wish to estimate a number u via some 
estimator Y = H(T ), where T := {X,,...,X,} is an iid sample from some unknown cdf 
F. It is assumed that Y does not depend on the order of the {X;}. To assess the quality (for 
example, accuracy) of the estimator Y, one could draw independent replications F1, ..., TN 
of T and find sample estimates for quantities such as the variance VarY, the bias EY — py, 
and the mean squared error E(Y — u). However, it may be too time-consuming or simply 
not feasible to obtain such replications. An alternative is to resample the original data. 
To reiterate, given an outcome T = {x1,..., Xn} of T, we simulate an iid sample T* := 
rs 76 {X7},...,X;} from the empirical cdf F,,, via Algorithm 3.2.6 (hence the resampling size is 


BOOTSTRAPPING 


N =n here). 
The rationale is that the empirical cdf F, is close to the actual cdf F and gets closer as 
n gets larger. Hence, any quantities depending on F, such as Erg(Y), where g is a function, 
can be approximated by Er g(Y). The latter is usually still difficult to evaluate, but it can 
be simply estimated via CMC as 
I£ 
Pein 


where Y;,..., Y% are independent random variables, each distributed as Y* = H(7*). This 
seemingly self-referent procedure is called bootstrapping — alluding to Baron von Miin- 
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chausen, who pulled himself out of a swamp by his own bootstraps. As an example, the 
bootstrap estimate of the expectation of Y is 


a ae ee 
BY=Y =), Y 


i=1 


which is simply the sample mean of {Y*}. Similarly, the bootstrap estimate for VarY is the 
sample variance 


=e 1 £ * y2 

VarY = Rat Fy. (3.14) 
Bootstrap estimators for the bias and MSE are Y -Y and t ZEY * — Y}, respectively. 
Note that for these estimators the unknown quantity u is replaced with its original estimator 
Y. Confidence intervals can be constructed in the same fashion. We mention two variants: 
the normal method and the percentile method. In the normal method, a 1 — œ confidence 
interval for u is given by 

(Y + Z1-e/25"), 


where S* is the bootstrap estimate of the standard deviation of Y; that is, the square root 
of (3.14). In the percentile method, the upper and lower bounds of the 1 — œ confidence 
interval for u are given by the 1 — a/2 and a@/2 quantiles of Y, which in turn are estimated 
via the corresponding sample quantiles of the bootstrap sample {Y;'}. 

The following example illustrates the usefulness of the bootstrap method for ratio es- 
timation and also introduces the renewal reward process model for data. 


E Example 3.12 (Bootstrapping the Ratio Estimator) A common scenario in stochastic 
simulation is that the output of the simulation consists of independent pairs of data 
(C1, Ri), (C2, R2), ..., where each C is interpreted as the length of a period of time — a so- 
called cycle — and R is the reward obtained during that cycle. Such a collection of random 
variables {(C;, R;)} is called a renewal reward process. Typically, the reward R; depends on 
the cycle length C;. Let A, be the average reward earned by time t; that is, A, = ©, Ri/t, 
where N, = max{n : Cı +---+C, < t} counts the number of complete cycles at time t. It 
can be shown, see Exercise 20, that if the expectations of the cycle length and reward are 
finite, then A, converges to the constant ER/EC. This ratio can thus be interpreted as the 
long-run average reward. 

Estimation of the ratio ER/EC from data (C1, R1),...,(Cn, Rn) is easy: take the ratio 
estimator 


A= 


All >! 


However, this estimator A is not unbiased and it is not obvious how to derive confidence 
intervals. Fortunately, the bootstrap method can come to the rescue: simply resample the 
pairs {(C;, R;)}, obtain ratio estimators Aj,...,A}, and from these compute quantities of 
interest such as confidence intervals. 

As a concrete example, let us return to the Markov chain in Example 3.6. Recall that 
the chain starts at state 1 at time 0. After a certain amount of time 7), the process returns 
to state 1. The time steps 0,...,Tı — 1 form a natural “cycle” for this process, as from 
time Tı onwards the process behaves probabilistically exactly the same as when it started, 
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independently of Xo,...,Xz7,-1. Thus, if we define Tọ = 0, and let T; be the i-th time that 
the chain returns to state 1, then we can break up the time interval into independent cycles 
of lengths C; = T; — Ti-1, i = 1,2,.... Now suppose that during the i-th cycle a reward 


Rj = or) 


is received, where r(i) is some fixed reward for visiting state i € {1,2,3,4} and o € (0, 1) 
is a discounting factor. Clearly, {(C;, R;)} is a renewal reward process. Figure 3.8 shows the 
outcomes of 1000 pairs (C, R), using r(1) = 4, r(2) = 3, r(3) = 10, r(4) = 1, and o = 0.9. 


60 pf 


20 
10 CO ° 0 


0 1 1 1 1 
0 10 20 30 40 50 60 70 


Cycle length 








Figure 3.8: Each circle represents a (cycle length, reward) pair. The varying circle sizes 
indicate the number of occurrences for a given pair. For example, (2,15.43) is the most 
likely pair here, occurring 186 out of a 1000 times. It corresponds to the cycle path 1 — 
332 1. 


The long-run average reward is estimated as 2.50 for our data. But how accurate is this 
estimate? Figure 3.9 shows a density plot of the bootstrapped ratio estimates, where we 
independently resampled the data pairs 1000 times. 


4 


density 
N 


2.2 2.4 2.6 2.8 
long-run average reward 


Figure 3.9: Density plot of the bootstrapped ratio estimates for the Markov chain renewal 
reward process. 
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Figure 3.9 indicates that the true long-run average reward lies between 2.2 and 2.8 
with high confidence. More precisely, the 99% bootstrap confidence interval (percentile 
method) is here (2.27, 2.77). The following Python script spells out the procedure. 


ratioest.py 


import numpy as np, matplotlib.pyplot as plt, seaborn as sns 
from numba import jit 


np.random.seed(123) 
= 1000 

P = np.array([[0, 0.2, 0.5, 0.3], 
[0.5 ,0, 0.5, 0], 
[0.3, 0.7, 0, 0], 
[0.1, ©, ©, 0.9]]) 

r = np.array([4,3,10,1]) 

Corg = np.array(np.zeros((n,1))) 

Rorg = np.array(np.zeros((n,1))) 

rho=0.9 


@jitQ #for speed-up; see Appendix 
def generate_cyclereward(n): 
for i in range(n): 
ail 
xreg = 1 #regenerative state (out of 1,2,3,4) 
reward = r[0] 
X= np.amin(np.argwhere(np.cumsum(P[xreg-1,:]) > np.random. 
rand())) + 1 
while x != xreg: 
t += 1 
reward += rho**(t-1)*r[x-1] 
X = np.amin(np.where(np.cumsum(P[x-1,:]) > np.random.rand 
(OD) ar al 
Corg[i] = t 
Rorg[i] = reward 
return Corg, Rorg 


Corg, Rorg = generate_cyclereward(n) 


Aorg = np.mean(Rorg)/np.mean(Corg) 

= 5000 

= np.array(np.zeros((K,1))) 

= np.array(np.zeros((n,1))) 

= np.array(np.zeros((n,1))) 
i in range(K): 
ind = np.ceil(n*np.random.rand(1,n)).astype(Cint) [0]-1 
C = Corg[ind] 
R = Rorg[ind] 
A[i] = np.mean(R)/np.mean(C) 


-Xlabel(C'long-run average reward') 
-ylabel('density') 
.kdeplot(A.flatten() , shade=True) 

. Show () 
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3.3.3 Variance Reduction 


The estimation of performance measures in Monte Carlo simulation can be made more 
efficient by utilizing known information about the simulation model. Variance reduction 
techniques include antithetic variables, control variables, importance sampling, conditional 
Monte Carlo, and stratified sampling; see, for example, [71, Chapter 9]. We shall only deal 
with control variables and importance sampling here. 


Suppose Y is the output of a simulation experiment. A random variable Y, obtained 
from the same simulation run, is called a control variable for Y if Y and Y are correlated 
(negatively or positively) and the expectation of Y is known. The use of control variables 
for variance reduction is based on the following theorem. We leave its proof to Exercise 21. 


Theorem 3.3: Control Variable Estimation 





From (3.16) we see that, by using the optimal œ in (3.15), the variance of the control 
variate estimator is a factor 1 — Oe smaller than the variance of the crude Monte Carlo 


estimator. Thus, if Y is highly correlated with Y, a significant variance reduction can be 
achieved. The optimal a is usually unknown, but it can be easily estimated from the sample 
covariance matrix of {(Y;, Yp}. 

In the next example, we estimate the multiple integral in Example 3.10 using control 
variables. 


E Example 3.13 (Monte Carlo Integration (cont.)) The random variable Y = |X; +X2 + 
X;|'/?(27)°/? is positively correlated with the random variable Y = X? + X2 + XZ, for the 
same choice of X1, X2, X3 rg N(0, 1). As EY = Var(X, + X2 + X3) = 3, we can use it as a 
control variable to estimate the expectation of Y. The following Python program is based 


on Theorem 3.3. It imports the crude Monte Carlo sampling code from Example 3.10. 
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mcintCV.py 


* 


from mcint import 


Yc np.sum(x**2, axis=1) # control variable data 
yc 3 # true expectation of control variable 

C = np.cov(y,Yc) # sample covariance matrix 

cor = C[0][1]/np.sqrt(c[0][0]*CL1][1]) 

alpha = C[0][1]/C[1][1] 


est np.mean(y-alpha*(Yc-yc)) 
RECV np.sqrt((1-cor**2)*C[0][0]/N)/est #relative error 


prine Estimate = {3 3f] Cl — CA3 23th tse Corr = 4a. « 
format(est, est*(1-z*RECV), est*(1+z*RECV) ,cor)) 





Estimate = 17.045, CI = (17.032,17.057), Corr = 0.480 


A typical estimate of the correlation coefficient oyy is 0.48, which gives a reduction of 
the variance with a factor 1 —0.48? ~ 0.77 — a simulation speed-up of 23% compared with 
crude Monte Carlo. Although the gain is small in this case, due to the modest correlation 
between Y and Y, little extra work was required to achieve this variance reduction. a 


One of the most important variance reduction techniques is importance sampling. This IMPORTANCE 
technique is especially useful for the estimation of very small probabilities. The standard PAMELING 
setting is the estimation of a quantity 


u = E;H(X) = f H(x) f(x)dx, (3.17) 


where H is a real-valued function and f the probability density of a random vector X, 
called the nominal pdf. The subscript f is added to the expectation operator to indicate that 
it is taken with respect to the density f. 

Let g be another probability density such that g(x) = 0 implies that H(x) f(x) = 0. 
Using the density g we can represent u as 


NOMINAL PDF 


A} (3.18) 


f(x) 
= | Hx) = dx = E, | H(X) —— 
u { OO g(x) | ( O 


Consequently, if X,,..., Xy ~iia g, then 


__1< f (Xx) 
=— A(X 3.19 
ji wd (XK) (3.19) 





is an unbiased estimator of u. This estimator is called the importance sampling estimator Dr 
and g is called the importance sampling density. The ratio of densities, f(x)/g(x), is called SAMPLING 


the likelihood ratio. The importance sampling pseudo-code is given in Algorithm 3.3.2. utenti 
LIKELIHOOD RATIO 
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Algorithm 3.3.2: Importance Sampling Estimation 
input: Function H, importance sampling density g such that g(x) = 0 for all x for 
which H(x) f(x) = 0, sample size N, confidence level 1 — a. 
output: Point estimate and approximate (1 — œ) confidence interval for 
H = EH(X), where X ~ f. 
1 Simulate X4,..., Xy a g and let Y; = H(X;)f(X;)/g(X;), i= 1,...,N. 
2 Estimate u via fi = Y and determine an approximate (1 — œ) confidence interval as 
ee 
= — £1-a/2— Z1-a/2—j—= |> 
ON VN 
where z, denotes the y-quantile of the N(0, 1) distribution and S is the sample 
standard deviation of Y;,..., Yy. 
3 return £ and the interval Z. 
E Example 3.14 (Importance Sampling) Let us examine the workings of importance 
sampling by estimating the area, u say, under the graph of the function 
M(x, 1) = ei V5 (sin (2 Je + x3) + i), (x1,%) € RÈ. (3.20) 
c= 80 We saw a similar function in Example 3.8 (but note the different domain). A natural ap- 


proach to estimate the area is to truncate the domain to the square [—b, b]’, for large enough 
b, and to estimate the integral 


b b 
My = f f (2b? M(x) f(x)dx = E;H(X) 
-b -b -o 


via crude Monte Carlo, where f(x) = 1/(2b)*, x € [—b, b]?, is the pdf of the uniform distri- 
bution on [—b, b]?. Here is the Python code which does just that. 


impsamp1.py 


import numpy as np 
from numpy import exp, sqrt, sin, pi, log, cos 
from numpy.random import rand 


1000 
lambda x1, x2: (2*b)**2 * exp(-sqrt(x1**2+x2**2)/4)*(sin(2*sqrt( 
MDE Re eel) Cxales E r22 < b2 
172) **2) 
10**6 
-b + 2*b*rand(N,1) 
-b + 2*b*rand(N,1) 
Z = HXI X2) 
estCMC = np.mean(Z).item() # to obtain scalar 
RECMC = np.std(Z)/estCMC/sqrt(N).item() 
print @ Gl = (bs Sth soot RE S e eo format CestCMC Ci 12.06% 
RECMC), estCMC*(1+1.96*RECMC) , RECMC)) 


CI = (82.663,135.036), RE = 0.123 
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For a truncation level of b = 1000 and a sample size of N = 10°, a typical estimate is 
108.8, with an estimated relative error of 0.123. We have two sources of error here. The 
first is the error in approximating u by up. However, as the function H decays exponentially 
fast, b = 1000 is more than enough to ensure this error is negligible. The second type of 
error is the statistical error, due to the estimation process itself. This can be quantified by 
the estimated relative error, and can be reduced by increasing the sample size. 

Let us now consider an importance sampling approach in which the importance 
sampling pdf g is radially symmetric and decays exponentially in the radius, similar to the 
function H. In particular, we simulate (X1, X2) in a way akin to Example 3.1, by first gen- 
erating a radius R ~ Exp(A) and an angle © ~ U(0, 27), and then returning X, = Rcos(®) 
and X, = Rsin(@). By the Transformation Rule (Theorem C.4) we then have 


1 11 Ae Atv 
80) = frol,- = de —— = A ye R?\ {0}. 


nr ? 
Dida, + x3 


The following code, which imports the one given above, implements the importance 
sampling steps, using the parameter 4 = 0.1. 


impsamp2.py 


from impsamp1 import * 


lam = 9.1; 

g = lambda x1, x2: lam*exp(-sqrt(x1**2 + x2**2)*lam)/sqrt(x1**2 + x2 
22229) @2 pa); 

rand(N,1); V = rand(N,1) 

-log(U)/lam 

R*cos(2*pi*V) 

R*sin(2*pi*V) 

Z = H(X1,X2)*f/9 (X1, X25) 

estIS = np.mean(Z).item(Q) # obtain scalar 

REIS = np.std(Z)/estIS/sqrt(N).item() 

print Gl = Cia se) {tise st hy RE = {3 3.39f) -format CestIS= (1-1967 
REIS), estIS*(1+1.96*REIS),REIS)) 


CI = (100.723,101.077), RE = 0.001 


A typical estimate is 100.90 with an estimated relative error of 1 - 1074, which gives 
a substantial variance reduction. In terms of approximate 95% confidence intervals, we 
have (82.7,135.0) in the CMC case versus (100.7,101.1) in the importance sampling case. 
Of course, we could have reduced the truncation level b to improve the performance of 
CMC, but then the approximation error might become more significant. For the importance 
sampling case, the relative error is hardly affected by the threshold level, but does depend 
on the choice of 4. We chose 4 such that the decay rate is slower than the decay rate of the 
function H, which is 0.25. Oo 


As illustrated in the above example, a main difficulty in importance sampling is how to 
choose the importance sampling distribution. A poor choice of g may seriously affect the 
accuracy of both the estimate and the confidence interval. The theoretically optimal choice 
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g* for the importance sampling density minimizes the variance of jz and is therefore the 
solution to the functional minimization program 

f(X 3) 
min Var, | H(X) ——]. (3.21) 
8 j | g(X) 

is 118 It is not difficult to show, see also Exercise 22, that if either H(x) > 0 or H(x) < 0 for all 

OPTIMAL x, then the optimal importance sampling pdf is 

IMPORTANCE 


SAMPLING PDF 


SIMULATED 
ANNEALING 
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(É 





g (x)= AOE) (3.22) 
H 

Namely, in this case Vary 4 = Vary (H(X)f(X)/g(X)) = Varu = 0, so that the estimator {i 

is constant under g*. An obvious difficulty is that the evaluation of the optimal importance 

sampling density g* is usually not possible, since g*(x) in (3.22) depends on the unknown 

quantity u. Nevertheless, one can typically choose a good importance sampling density g 


“close” to the minimum variance density g*. 


One of the main considerations for choosing a good importance sampling pdf is that 
the estimator (3.19) should have finite variance. This is equivalent to the requirement 


that 
PÆ fa) ae 


= : 2 ID 
wa oe ja D 


This suggests that g should not have lighter tails than f and that, preferably, the 
likelihood ratio, f/g, should be bounded. 


(3.23) 


E, [ræ 


3.4 Monte Carlo for Optimization 


In this section we describe several Monte Carlo methods for optimization. Such random- 
ized algorithms can be useful for solving optimization problems with many local optima 
and complicated constraints, possibly involving a mix of continuous and discrete variables. 
Randomized algorithms are also used to solve noisy optimization problems, in which the 
objective function is unknown and has to be obtained via Monte Carlo simulation. 


3.4.1 Simulated Annealing 


Simulated annealing is a Monte Carlo technique for minimization that emulates the phys- 
ical state of atoms in a metal when the metal is heated up and then slowly cooled down. 
When the cooling is performed very slowly, the atoms settle down to a minimum-energy 
state. Denoting the state as x and the energy of a state as S (x), the probability distribution 
of the (random) states is described by the Boltzmann pdf 


sS% 
face Tt, xe, 


where k is Boltzmann’s constant and T is the temperature. 
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Going beyond the physical interpretation, suppose that S$ (x) is an arbitrary function to 
be minimized, with x taking values in some discrete or continuous set X. The Gibbs pdf 
$ : GIBBS PDF 
corresponding to S(x) is defined as 
eT 
fr(x) = > x EX, 
ZT 
provided that the normalization constant zr := >), exp(—S(x)/T) is finite. Note that this 
is simply the Boltzmann pdf with the Boltzmann constant k removed. As T — 0, the pdf 
becomes more and more peaked around the set of global minimizers of S. 
The idea of simulated annealing is to create a sequence of points X1, X>,... that are ap- 
proximately distributed according to pdfs fr,(x), fr,(x),..., where T1, T2,... is a sequence 
of “temperatures” that decreases (is “cooled”) to 0 — known as the annealing schedule. If siden 
each X, were sampled exactly from fr, then X, would converge to a global minimum of SCHEDULE 
S(x) as T, — 0. However, in practice sampling is approximate and convergence to a global 
minimum is not assured. A generic simulated annealing algorithm is as follows. 
Algorithm 3.4.1: Simulated Annealing 
input: Annealing schedule To, 7),...,, function S, initial value xo. 
output: Approximations to the global minimizer x* and minimum value S (x*). 
1 Set Xo — xo andt < 1. 
2 while not stopping do 
3 Approximately simulate X, from fr, (x). 
4 f te—tt+l1 
5 return X,,S(X,) 
A popular annealing schedule is geometric cooling, where T, = BT;-;,t = 1,2,..., for TEEN 
a given initial temperature To and a cooling factor B € (0,1). Appropriate values for To COOLING 


and £ are problem-dependent and this has traditionally required tuning on the part of the 
user. A possible stopping criterion is to stop after a fixed number of iterations, or when the 
temperature is “small enough”. 

Approximate sampling from a Gibbs distribution is most often carried out via Markov 
chain Monte Carlo. For each iteration t, the Markov chain should theoretically run for a 
large number of steps to accurately sample from the Gibbs pdf fr,. However, in practice, 
one often only runs a single step of the Markov chain, before updating the temperature, as 
in Algorithm 3.4.2 below. 


To sample from a Gibbs distribution fr, this algorithm uses a random walk Metropolis- 
Hastings sampler. From (3.7), the acceptance probability of a proposal y is thus 


e7750) 


a(x, y) = min{ 1} = min C 1} , 


eo 75)’ 


Hence, if S(y) < S(x), then the proposal is aways accepted. Otherwise, the proposal is 
accepted with probability exp(—4(S(y) - S(x))). 


COOLING FACTOR 
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Algorithm 3.4.2: Simulated Annealing with a Random Walk Sampler 
input: Objective function S, starting state Xo, initial temperature To, number of 
iterations N, symmetric proposal density g(y |x), constant £. 
output: Approximate minimizer and minimum value of S. 
1 fort = 0 to N — 1 do 
Simulate a new state Y from the symmetric proposal q(y | X»). 
if S (Y) < S(X,) then 
Xni Y 
else 
Draw U ~ U(0, 1). 
if U < eS O-5%D/T: then 
| Xn Y 
else 
| Xm — X; 


CS oo lu AAA BF WwW NY 


m= 
=] 








u | 1g t= PT, 
12 return Xy and S (Xy) 


E Example 3.15 (Simulated Annealing for Minimization) Let us minimize the “wig- 
gly” function depicted in the bottom panel of Figure 3.10 and given by: 


s —e-*/100 sin(13x — x4) sin(1 - 3x2}, if-2<x<2, 
x)= 
o0, otherwise. 














Figure 3.10: Lower panel: the “wiggly” function S(x). Upper panel: three (unnormalized) 
Gibbs pdfs for temperatures T = 1,0.4.0.2. As the temperature decreases, the Gibbs pdf 
converges to the pdf that has all its mass concentrated at the minimizer of S. 


Chapter 3. Monte Carlo Methods 





The function has many local minima and maxima, with a global minimum around 1.4. 
The figure also illustrates the relationship between S and the (unnormalized) Gibbs pdf fr. 


The following Python code implements a slight variant of Algorithm 3.4.2 where, in- 
stead of stopping after a fixed number of iterations, the algorithm stops when the temper- 
ature is lower than some threshold (here 107°). 


Instead of stopping after a fixed number N of iterations or when the temperature 
¥ is low enough, it is useful to stop when consecutive function values are closer than 
some distance € to each other, or when the best found function value has not changed 
over a fixed number d of iterations. 





For a “current” state x, the proposal state Y is here drawn from the N(x, 0.57) distri- 
bution. We use geometric cooling with decay parameter § = 0.999 and initial temperature 
To = 1. We set the initial state to x9 = 0. Figure 3.11 depicts a realization of the sequence 
of states x, for t = 0,1,.... After initially fluctuating wildly, the sequence settles down 
to a value around 1.37, with $(1.37) = —0.92, corresponding to the global optimizer and 
minimum, respectively. 


import numpy as np 
import matplotlib.pyplot as plt 


def wiggly(x): 
y = -np.exp(x**2/100) *np.sin(13*x-x**4) **5*np.sin(1-3*x**2) **2 
ind = np.vstack((np.argwhere(x<-2) ,np.argwhere(x>2))) 
y[Lind]=floatC'inf') 
return y 


S = wiggly 
beta = 0.999 
sig 
T=1 
x= np.array([0]) 
xx=] 
Sx=S (x) 
while T>10**(-3): 
T=beta*T 
y = x+sig*np.random.randn() 
Sy = S(y) 
alpha = np.amin((np.exp(-(Sy-Sx)/T) ,1)) 
if np.random.uniform() <alpha: 
x=y 
Sx=Sy 
xx=np.hstack((xx,x)) 


print('minimizer = {:3.3f}, minimum ={:3.3f}'.format(x[0],Sx[0])) 
plt.plot(xx) 
plt.show() 


minimizer = 1.365, minimum = -0.958 
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Figure 3.11: Typical states generated by the simulated annealing algorithm. 


3.4.2 Cross-Entropy Method 


The cross-entropy (CE) method [103] is a simple Monte Carlo algorithm that can be used 
for both optimization and estimation. 

The basic idea of the CE method for minimizing a function S on a set X is to define 
a parametric family of probability densities {f(- |v), v € V} on X and to iteratively update 
the parameter v so that f(-|v) places more mass on states x that have smaller S values than 
on the previous iteration. In particular, the CE algorithm has two basic phases: 


e Sampling: Samples X,,...,Xy are drawn independently according to f(-|v). The 
objective function S is evaluated at these points. 


e Updating: A new parameter v’ is selected on the basis of those X; for which S (X;) < 
y for some level y. These {X;} form the elite sample set, &. 


At each iteration the level parameter y is chosen as the worst of the N° := [oN] 
best performing samples, where o € (0, 1) is the rarity parameter — typically, o = 0.1 or 
o = 0.01. The parameter v is updated as a smoothed average av’ +(1—a@)yv, where a € (0, 1) 
is the smoothing parameter and 


v’ := argmax » In f(X |v). (3.24) 
veV YEE 


The updating rule (3.24) is the result of minimizing the Kullback—Leibler divergence 
between the conditional density of X ~ f(x|v) given S(X) < y, and f(x;v); see [103]. 
Note that (3.24) yields the maximum likelihood estimator (MLE) of v based on the elite 
samples. Hence, for many specific families of distributions, explicit solutions can be found. 
An important example is where X ~ N(u, diag(o)); that is, X has independent Gaussian 


ae 


(É 
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components. In this case, the mean vector u and the vector of variances o° are simply 
updated via the sample mean and sample variance of the elite samples. This is known as 
normal updating. A generic CE procedure for minimization is given in Algorithm 3.4.3. KORAT 
UPDATING 
Algorithm 3.4.3: Cross-Entropy Method for Minimization 
input: Function S, initial sampling parameter vo, sample size N, rarity parameter 
o, smoothing parameter œ. 
output: Approximate minimum of S and optimal sampling parameter v. 
1 Initialize vo, set N — [oN] and t < 0. 
2 while a stopping criterion is not met do 
3 t—t+l 
4 Simulate an iid sample X,,..., Xy from the density f(-|v,_1). 
5 Evaluate the performances S (X1), ...,S (Xy) and sort them from smallest to 
largest: Sa), sis „S (N). 
6 Let y, be the sample o-quantile of the performances: 
Vt S S (elite). (3.25) 
7 Determine the set of elite samples &, = {X; : S(X;) < y;}. 
8 Let v; be the MLE of the elite samples: 
v; — argmax » In f(X |v). (3.26) 
y XE&;, 
9 Update the sampling parameter as 
v; — av, + (1 — aW. (3.27) 
10 return y; V; 
The CE algorithm produces a sequence of pairs (y1, v1), (y2, V2), ..., Such that y; con- 
verges (approximately) to the minimal function value, and f(-|v,) to a degenerate pdf that 
(approximately) concentrates all its mass at a minimizer of S, as t — oo. A possible stop- 
ping condition is to stop when the sampling distribution f(-|v,) is sufficiently close to a 
degenerate distribution. For normal updating this means that the standard deviation is suf- 
ficiently small. 
The output of the CE algorithm could also include the overall best function value 
and corresponding solution. 
In the following example, we minimize the same function as in Example 3.15, but is 97 


instead use the CE algorithm. 


E Example 3.16 (Cross-Entropy Method for Minimization) In this case we take the 
family of normal distributions {N(u, o)} for the sampling step (Step 4 of Algorithm 3.4.3), 
starting with u = 0 and © = 3. The choice of the initial parameter is quite arbitrary, as long 
as o- is large enough to sample a wide range of points. We take N = 100 samples at each it- 
eration, set o = 0.1, and keep the N°"! = 10 = [No] smallest ones as the elite samples. The 
parameters u and o are then updated via the sample mean and sample standard deviation 
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of the elite samples. In this case we do not use any smoothing (œ = 1). In the following 
Python code the 100 x 2 matrix Sx stores the x-values in the first column and the func- 
tion values in the second column. The rows of this matrix are sorted in ascending order 
according to the function values, giving the matrix sortSx. The first N = 10 rows of 
this sorted matrix correspond to the elite samples and their function values. The updating 
of u and o is done in Lines 14 and 15. Figure 3.12 shows how the pdfs of the N(u;, o?) 
sampling distributions degenerate to the point mass at the global minimizer 1.366. 


CEmethod.py 


from simann import wiggly 
import numpy as np 
np.set_printoptions (precision=3) 
mu, Sigma = 0, 3 
N, Nel = 100, 10 
eps = 10**-5 
S = wiggly 
while sigma > eps: 
X = np.random.randn(N,1)*sigma + np.array(np.ones((N,1)))*mu 
Sx = np.hstack((X, S(X))) 
sortSx = Sx[Sx[:,1].argsort(,] 
Elite = sortSx[0:Nel,:-1] 
mu = np.mean(Elite, axis=0) 
sigma = np.std(Elite, axis=0) 
print('S(mu)= {}, mu: {}, sigma: {}\n'.format(S(mu), mu, sigma)) 


Scmu)= [0.071], mu: [0.414], sigma: [0.922] 
SCmu)= [0.063], mu: [0.81], sigma: [0.831] 

S (mu) = .033], : [1.212], sigma: [0.69] 

S (mu) = .588], : [1.447], sigma: [0.117] 

S (mu) = .958], : [1.366], sigma: [0.007] 

S (mu) = .958], : [1.366], sigma: [0.] 

S (mu) = .958], : [1.366], sigma: [3.535e-05] 
S (mu) = .958], : [1.366], sigma: [2.023e-06] 


[-0 
[-0 
[-0 
[-0 
[-0 
[-0 








Figure 3.12: The normal pdfs of the first five sampling distributions, truncated to the inter- 
val [-2, 3]. The initial sampling distribution is N(0, 37). 
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3.4.3 Splitting for Optimization 


Minimizing a function S(x), x € X is closely related to drawing a random sample from a 
level set of the form {x € X : S(x) < y}. Suppose S has minimum value y* attained at x”. 
As long as y > y’, this level set contains the minimizer. Moreover, if y is close to y*, the 
volume of this level set will be small. So, a randomly selected point from this set is expected 
to be close to x*. Thus, by gradually decreasing the level parameter y, the level sets will 
gradually shrink towards the set {x*}. Indeed, the CE method was developed with exactly 
this connection in mind; see, e.g., [102]. Note that the CE method employs a parametric 
sampling distribution to obtain samples from the level sets (the elite samples). In [34] 
a non-parametric sampling mechanism is introduced that uses an evolving collection of 
particles. The resulting optimization algorithm, called splitting for continuous optimization 
(SCO), provides a fast and accurate way to optimize complicated continuous functions. The 
details of SCO are given in Algorithm 3.4.4. 


Algorithm 3.4.4: Splitting for Continuous Optimization (SCO) 
input: Objective function S, sample size N, rarity parameter eo, scale factor w, 
bounded region 8 C X that is known to contain a global minimizer, and 
maximum number of attempts MaxTry. 
output: Final iteration number t and sequence (Xpest,1, b1), - <- > (Xbestt, b1) of best 
solutions and function values at each iteration. 
1 Simulate Yo = {Y,,..., Yy} uniformly on B. Set t — 0 and N — [No]. 
2 while stopping condition is not satisfied do 








3 | Determine the N smallest values, S (1) < +++ < S weie), of {S (X), X € Yj}, 
and store the corresponding vectors, X(1),..., X oyei), in X1. Set bai — So) 
and Xnest Xa): 

4 | Draw B; ~ Bernoulli(4), i = 1,...,N°"*, with XX3" B; = N mod Ne, 

5 for i = 1 to N” do 

6 Ri — l] + Bi // random splitting factor 

7 Y -Xp Y -Y 

8 for j = 1 to R; do 

9 Draw J € {1,..., N°} \ {i} uniformly and let o; — w|X® — X”). 

10 Simulate a uniform permutation m = (m1, ..., n) of (1,..., 7). 

11 for k = 1 tondo 

12 for Try = 1 toMaxTry do 

13 Y'n) — Y(n) + o(a)Z, Z~ NO, 1) 

14 | if S(Y’) < S(Y) then Y — Y’ and break. 

15 Add Y to Y1 

16 te—t+l1 





17 return {(Xvestk, bk) k = 1,...,0} 


At iteration t = 0, the algorithm starts with a population of particles Yo = {Y,,..., Yy} 
that are uniformly generated on some bounded region 8, which is large enough to contain 
a global minimizer. The function values of all particles in Yo are sorted, and the best 
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N% = [No] form the elite particle set X,, exactly as in the CE method. Next, the elite 
particles are “split” into [V/N*°] children particles, adding one extra child to some of 
the elite particles to ensure that the total number of children is again N. The purpose of 
Line 4 is to randomize which elite particles receive an extra child. Lines 8—15 describe 
how the children of the i-th elite particle are generated. First, in Line 9, we select one 
of the other elite particles uniformly at random. The same line defines an n-dimensional 
vector o; whose components are the absolute differences between the vectors Xa and Xin, 
multiplied by a constant w. That is, 


Xo, = Xi, 


IXo,2 = X,2| 
Oiz Ww Xo = Xp! [= WwW . . 


Xin ~ X(n,nl 


Next, a uniform random permutation a of (1,...,7) is simulated (see Exercise 9). Lines 
11-14 describe how, starting from a candidate child point Y, each coordinate of Y is re- 
sampled, in the order determined by a, by adding a standard normal random variable to 
that component, multiplied by the corresponding component of o; (Line 13). If the result- 
ing Y’ has a function value that is less than that of Y, then the new candidate is accepted. 
Otherwise, the same coordinate is tried again. If no improvement is found in MaxTry at- 
tempts, the original component is retained. This process is performed for all elite samples, 
to produce the first-generation population Y1. The procedure is then repeated for iterations 
t = 1,2,..., until some stopping criterion is met, e.g., when the best found function value 
does not change for a number of consecutive iterations, or when the total number of func- 
tion evaluations exceeds some threshold. The best found function value and corresponding 
argument (particle) are returned at the conclusion of the algorithm. 


The input variable MaxTry governs how much computational time is dedicated to up- 
dating a component. In most cases we have encountered, the choices w = 0.5 and MaxTry 
= 5 work well. Empirically, relatively high value for o work well, such as o = 0.4, 0.8, or 
even o = 1. The latter case means that at each stage t all samples from Y,_; carry over to 
the elite set X;. 


m Example 3.17 (Test Problem 112) Hock and Schittkowski [58] provide a rich source 
of test problems for multiextremal optimization. A challenging one is Problem 112, where 
the goal is to find x so as to minimize the function 


10 


en Cree ee: 
s(x) = Y ulotne], 


j=l 


subject to the following set of constraints: 


xX, +2x.+2x34+X6+X9-2 = 0, 
Xyt+2x54+X6+2x,-1 = 0, 
X3 + X7 + Xg + 2X9 + X19 -1 = O, 


x; > 0.000001, j=1,...,10, 


where the constants {c;} are given in Table 3.1. 
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Table 3.1: Constants for Test Problem 112. 


cı = —6.089 cy = -17.164 c3 =—-34.054 cy= -5.914 cs = —24.721 
ce = —14.986 c= —24.100 cg = -10.708 co = —26.662 cio = —22.179 


The best known minimal value in [58] was —47.707579. In [89] a better solution was 
found, —47.760765, using a genetic algorithm. The corresponding solution vector was 
completely different from the one in [58]. A further improvement,—47.76109081, was 
found in [70], using the CE method, giving a similar solution vector to that in [89]: 


0.04067247 0.14765159 0.78323637 0.00141368 0.48526222 
0.00069291 0.02736897 0.01794290 0.03729653 0.09685870 


To obtain a solution with SCO, we first converted this 10-dimensional problem into a 
7-dimensional one by defining the objective function 


S10) = S x), 


where x2 = y1, X3 = y2, X5 = y3, X6 = Y4, X7 = Y5, X9 = Yo, X10 = y7, and 


Xi 2 — (2y1 + 2y2 + y4 + x7), 
x4 = 1-(2y3+y4+ ys), 
xs = 1- Q2+ys +2y6 + y7), 
subject to x1, .. ., X10 > 0.000001, where the {x;} are taken as functions of the {y;}. We then 
adopted a penalty approach (see Section B.4) by adding a penalty function to the original me 415 
objective function: 


10 
57) = S (x) + 1000 X` max{—(x; — 0.000001), 0}, 


i=l 


where, again, the {x;} are defined in terms of the {y;} as above. 
Optimizing this last function with SCO, we found, in less time than the other al- 
gorithms, a slightly smaller function value: —47.761090859365858, with solution vector 


0.040668 102417464 0.147730393049955 0.783153291185250 0.001414221643059 
0.485246633088859 0.000693172682617 0.027399339496606 0.017947274343948 
0.0373 14369272343 0.096871356429511 


in line with the earlier solutions. Oo 


3.4.4 Noisy Optimization 


In noisy optimization, the objective function is unknown, but estimates of function val- NOISY 
ues are available, e.g., via simulation. For example, to find an optimal prediction function OPTIMIZATIUN 
g in supervised learning, the exact risk &(g) = E Loss(Y, g(x)) is usually unknown and 

only estimates of the risk are available. Optimizing the risk is thus typically a noisy op- rs 20 


timization problem. Noisy optimization features prominently in simulation studies where 
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the behavior of some system (e.g., vehicles on a road network) is simulated under certain 
parameters (e.g., the lengths of the traffic light intervals) and the aim is to choose those 
parameters optimally (e.g., to maximize the traffic throughput). For each parameter setting 
the exact value for the objective function is unknown but estimates can be obtained via the 
simulation. 

In general, suppose the goal is to minimize a function S, where S is unknown, but 
an estimate of S(x) can be obtained for any choice of x € X. Because the gradient VS is 
unknown, one cannot directly apply classical optimization methods. The stochastic approx- 

PE E imation method mimics the classical gradient descent method by replacing a deterministic 
APPROXIMATION gradient with an estimate VS (x). 
A simple estimator for the i-th component of VS (x) (that is, 0S (u)/ðx;), is the central 
CENTRAL difference estimator 
eo S(x + e;5/2) — S (x —e;5/2) 
— a 


where e; denotes the i-th unit vector, and S(x+e; 6/2) and S(x-e; 6/2) can be any estimators 
of S(x + e; 6/2) and S (x — e; 6/2), respectively. The difference parameter 6 > 0 should be 
small enough to reduce the bias of the estimator, but large enough to keep the variance of 
the estimator small. 


(3.28) 


To reduce the variance in the estimator (3.28) it is important to have S(x + e;0/2) 


and S(x — e; 6/2) positively correlated. This can for example be achieved by using 





COMMON RANDOM common random numbers in the simulation. 
NUMBERS 
ns 412 In direct analogy to gradient descent methods, the stochastic approximation method 


produces a sequence of iterates, starting with some x; € X, via 
Xi =X, — B, VS (x), (3.29) 


where £),/2,... is a sequence of strictly positive step sizes. A generic stochastic approx- 
imation algorithm for minimizing a function S is thus as follows. 


Algorithm 3.4.5: Stochastic Approximation 
input: A mechanism to estimate any gradient VS (x) and step sizes 61,62, .... 
output: Approximate optimizer of S. 

1 Initialize xı € X. Sett <— 1. 

2 while a stopping criterion is not met do 

3 Obtain an estimated gradient VS (x) of S at x. 

4 Determine a step size f;. 

5 Set X41 — X; — b: VS (x). 

6 i= f+] 


7 return y, 


When VS (x,) is an unbiased estimator of VS(x,) in (3.29) the stochastic approxima- 
tion Algorithm 3.4.5 is referred to as the Robbins—Monro algorithm. When finite differ- 


Rossins—Monro . — ; . : 
ences are used to estimate VS (x;), as in (3.28), the resulting algorithm is known as the 
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Kiefer—Wolfowitz algorithm. In Section 9.4.1 we will see how stochastic gradient descent 
is employed in deep learning to minimize the training loss, based on a “minibatch” of 
training data. 

It can be shown [72] that, under certain regularity conditions on S, the sequence 
X1,X2,... converges to the true minimizer x* when the step sizes decrease slowly enough 
to 0; in particular, when 


X b=% and Xe <o. (3.30) 
t=1 t=1 


In practice, one rarely uses step sizes that satisfy (3.30), as the convergence of the 


sequence will be too slow to be of practical use. 





An alternative approach to stochastic approximation is the stochastic counterpart 
method, also called sample average approximation. It can be applied in situations where 
the noisy objective function is of the form 


S(x) =ES(x,é), xeX, (3.31) 


where & is a random vector that can be simulated and S(x, é) can be evaluated exactly. The 
idea is to replace the optimization of (3.31) with that of the sample average 


1 


S(x) = a S(x,é), x EX, (3.32) 


iM 


where €,,...,&y are iid copies of £. Note that S is a deterministic function of x and so can 
be optimized using any optimization algorithm. A solution to this sample average version 
is taken to be an estimator of a solution x* to the original problem (3.31). 


E Example 3.18 (Determining Good Importance Sampling Parameters) The selection 
of good importance sampling parameters can be viewed as a stochastic optimization prob- 
lem. Consider, for instance, the importance sampling estimator in Example 3.14. Recall 
that the nominal distribution is the uniform distribution on the square [—b, b]’, with pdf 


falx) = xE [-b, by, 


1 
Qh 
where b is large enough to ensure that u, is close to u; in that example, we chose b = 1000. 
The importance sampling pdf is 


1 1d Ae tvs 
ga(x) = frols, 07 = Ae —- = ~, x=(x1,x2) € R° \ {0}, 


nr ? 
2ma[xi + x5 


which depends on a free parameter 4. In the example we chose A = 0.1. Is this the best 
choice? Maybe 4 = 0.05 or 0.2 would have resulted in a more accurate estimate. The im- 
portant thing to realize is that the “effectiveness” of A can be measured in terms of the 
variance of the estimator fin (3.19), which is given by 
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8(X)} N SOIN N 8ga(X) 
Hence, the optimal parameter A* minimizes the function S (4) = E,[H?(X )f(X)/2(X)], 
which is unknown, but can be estimated from simulation. To solve this stochastic minim- 
ization problem, we first use stochastic approximation. Thus, at each step of the algorithm, 
the gradient of ES (4) is estimated from realizations of Sa) = H?(X)f(X)/g,(X), where 
X ~ fo. As in the original problem (that is, the estimation of u), the parameter b should 
be large enough to avoid any bias in the estimator of 4*, but also small enough to en- 
sure a small variance. The following Python code implements a particular instance of Al- 
gorithm 3.4.5. For sampling from f, here, we used b = 100 instead of b = 1000, as this will 
improve the crude Monte Carlo estimation of 4*, without noticeably affecting the bias. The 
gradient of ES (A) is estimated in Lines 11-17, using the central difference estimator (3.28). 
Notice how for the S (A—6/2) and S (A+6/2) the same random vector X = [X,, X2]" is used. 
This significantly reduces the variance of the gradient estimator; see also Exercise 23. The 
step size 6, should be such that BVS (x,) ~ Ar. Given the large gradient here, we choose 
Bo = 10-7 and decrease it each step by a factor of 0.99. Figure 3.13 shows how the se- 
quence Ao, 41, ... decreases towards approximately 0.125, which we take as an estimator 
for the optimal importance sampling parameter 2*. 


stochapprox.py 


import numpy as np 
from numpy import pi 
import matplotlib.pyplot as plt 


b=100 # choose b large enough, but not too large 
delta = 0.01 
H = lambda x1, x2: (2*b)**2*np.exp(-np.sqrt(x1**2 + x2**2)/4)*(Cnp. 
sin(2*np. sqrt (x1**2+x2**2)+1))*(Cx1**2+x2**2<b**2) 
1/(2*b)**2 
g lambda x1, x2, lam: lam*np.exp(-np.sqrt(x1**2+x2**2)*lam)/np. 
sqrt (x1**2+x2**2)/(2*pi) 
beta = 10**-7 #step size very small, as the gradient is large 
lam=0.25 
lams = np.array([lam]) 
N=10**4 
for i in range(200): 
xl = -b + 2*b*np.random.rand(N,1) 
x2 = -b + 2*b*np.random.rand(N,1) 
lamL lam - delta/2 
lamR lam + delta/2 
estL np.mean(H(x1,x2)**2*f£/g(x1, x2, lamL)) 
estR np.mean(H(x1,x2)**2*f/g(x1l, x2, lamR)) #use SAME x1,x2 
gr = (CestR-estL)/delta #gradient 
lam = lam - gr*beta #gradient descend 
lams np.hstack((lams, lam)) 
beta beta*9.99 


lamsize=range(0, (Clams.size)) 
plt.plot(lamsize, lams) 
plt.show() 
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0 25 50 75 100 125 150 175 200 
steps 


Figure 3.13: The stochastic optimization algorithm produces a sequence 4, t = 0,1,2,... 
that tends to an approximate estimate of the optimal importance sampling parameter A* ~ 
0.125. 


Next, we estimate 4* using a stochastic counterpart approach. As the objective function 
S(A) is of the form (3.31) (with A taking the role of x and X the role of £), we obtain the 
sample average 





N 
F4) = x ` H(X; fX) (3.33) 
i=l 


“ga Xy 
where X1,..., Xn ~iia fp. Once the X1,..., XN ~iia fọ have been simulated, Sa) is a de- 
terministic function of A, which can be optimized by any means. We take the most basic 
approach and simply evaluate the function for 2 = 0.01, 0.02, ...,0.3 and select the min- 
imizing 4 on this grid. The code is given below and Figure 3.14 shows S (4) as a function 
of A. The minimum value found was 1.60- 104 for minimizer 2* = 0. 12, which is in accord- 
ance with the value obtained via stochastic approximation. The sensitivity of this estimate 
can be assessed from the graph: for a wide range of values (say from 0.04 to 0.15) S stays 
rather flat. So any of these values could be used in an importance sampling procedure to 
estimate u. However, very small values (less than 0.02) and large values (greater than 0.25) 
should be avoided. Our original choice of 4 = 0.1 was therefore justified and we could not 
have done much better. 


stochcounterpart .py 


from stochapprox import * 


lams = np.linspace(0.01, 0.31, 1000) 
res=[] 
res = np.array(res) 
for i in range(lams.size): 

lam = lams[il] 

np.random.seed(1) 

= lambda x1, x2: lam*np.exp(-np.sqrt(x1**2+x2**2)*lam)/np.sqrt 
(x1**2+x2**2)/(2*pi) 
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X=-b+2*b*np.random.rand(N,1) 
Y=-b+2*b*np.random.rand(N,1) 
Z=H(X,Y) **2* £/g(X,Y) 

estCMC = np.mean(Z) 

res = np.hstack((res, estCMC)) 


plt.plot(lams, res) 

plt.xlabel(r'$\lambda$') 

plt.ylabel(r'$\hat{S}(\lambda)$') 
plt.ticklabel_format(style='sci', axis='y', scilimits=(0,0)) 
plt.show() 
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Figure 3.14: The stochastic counterpart method replaces the unknown S (A) (that is, the 
scaled variance of the importance sampling estimator) with its sample average, S(A). The 
minimum value of S is attained around A = 0.12. 


A third method for stochastic optimization is the cross-entropy method. In particular, 
Algorithm 3.4.3 can easily be modified to minimize noisy functions S(x) = ES (x, £), as 
defined in (3.31). The only change required in the algorithm is that every function value 
S (x) be replaced by its estimate S(x). Depending on the level of noise in the function, the 
sample size N might have to be increased considerably. 


E Example 3.19 (Cross-Entropy Method for Noisy Optimization) To explore the use 
of the CE method for noisy optimization, take the following noisy discrete optimization 
problem. Suppose there is a “black box” that contains an unknown binary sequence of n 
bits. If one feeds the black box any input vector, it will first scramble the input by inde- 
pendently flipping the bits (changing 0 to 1 and 1 to 0) with a probability 6 and then return 
the number of bits that match the true (unknown) binary sequence. This is illustrated in 
Figure 3.15 for n = 10. 
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1001011101 input 


Y 


1100010101] scrambled 
1111100000) true 


Y 


4 output 





Figure 3.15: A noisy optimization function as a black box. The input to the black box is a 
binary vector. Inside the black box the digits of the input vector are scrambled by flipping 
bits with probability 6. The output is the number of bits of the scrambled vector that match 
the true (unknown) binary vector. 


Denoting by S(x) the true number of matching digits for a binary input vector x, the 
black box thus returns a noisy estimate S(x). The objective is to estimate the binary se- 
quence inside the black box, by feeding it with many input vectors and observing their 
output. Or, to put it in a different way, to maximize S (x) using S(x) as a proxy. Since there 
are 2” possible input vectors, it is infeasible to try all possible vectors x even for moderate 
n. 

The following Python program implements the noisy function S(x) for n = 100. Each 
input bit is flipped with a rather high probability 6 = 0.4, so that the output is a poor indic- 
ator of how many bits actually match the true vector. This true vector has 1s at positions 
1,...,50 and Os at 51,..., 100. 


Snoisy.py 


import numpy as np 


def Snoisy(X): #takes a matrix 
n = X.shape[1] 
N = X.shape[0] 
# true binary vector 
xorg = np.hstack((np.ones((1,n//2)), np.zeros((1,n//2)))) 
theta = 0.4 # probability to flip the input 
# storing the number of bits unequal to the true vector 
s = np.zeros(N) 
for i in range(0,N): 
# determine which bits to flip 
flip = (np.random.uniform(size=(n)) < theta).astype(Cint) 
ind = flip>0 
X[i][ind] = 1-X[i][ind] 
sliil = (X[i] != xorg).sum() 
return s 


The CE code below to optimize S(x) is quite similar to the continuous optimization 
code in Example 3.16. However, instead of sampling iid random variables X;,..., Xy from 
anormal distribution, we now sample iid binary vectors X;,..., Xy from a Ber(p) distribu- 
tion. More precisely, given a row vector of probabilities p = [p,..., Pn], we independently 
simulate the components X),...,X, of each binary vector X according to X; ~ Ber(p;), 
i = 1,...,n. After each iteration, the vector p is updated as the (vector) mean of the elite 
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samples. Note that, in contrast to the minimization problem in Example 3.16, the elite 
samples now correspond to the largest function values. The sample size is N = 1000 and 
the number of elite samples is 200. The components of the initial sampling vector p are all 
equal to 1/2; that is, the X are initially uniformly sampled from the set of all binary vectors 
of length n = 100. At each subsequent iteration the parameter vector is updated via the 
mean of the elite samples and evolves towards a degenerate vector p* with only 1s and Os. 
Sampling from such a Ber(p*) distribution gives an outcome x* = p*, which can be taken 
as an estimate for the maximizer of S; that is, the true binary vector hidden in the black 
box. The algorithm stops when p has degenerated sufficiently. 

Figure 3.16 shows the evolution of the vector of probabilities p. This figure may be 
seen as the discrete analogue of Figure 3.12. We see that, despite the high noise, the CE 
method is able to find the true state of the black box, and hence the maximum value of S. 
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Figure 3.16: Evolution of the vector of probabilities p = [p,..., Pn] towards the degener- 
ate solution. 
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CEnoisy.py 


from Snoisy import Snoisy 
import numpy as np 
n = 100 
rho = 0.1 
1000; Nel = int(N*rho); eps = 0.01 
0.5*np.ones(n) 
0 
pstart = p 
ps = np. zeros((1000,n)) 
ps[0] = pstart 
pdist = np.zeros((1,1000)) 
while np.max(np.minimum(p,1-p)) > eps: 
i += 1 
= (np.random.uniform(size=(N,n)) < p).astype(int) 
_tmp = np.array(X, copy=True) 
SX = Snoisy(X_tmp) 
ids = np.argsort(SX,axis=0) 
Elite = X[ids[0:Nel],:] 
p = np.mean(Elite,axis=0) 
ps[i] = p 
print (p) 





Further Reading 


The article [68] explores why the Monte Carlo method is so important in today’s quantitat- 
ive investigations. The Handbook of Monte Carlo Methods [71] provides a comprehensive 
overview of Monte Carlo simulation that explores the latest topics, techniques, and real- 
world applications. Popular books on simulation and the Monte Carlo method include [42], 
[75], and [104]. A classic reference on random variable generation is [32]. Easy introduc- 
tions to stochastic simulation are given in [49], [98], and [100]. More advanced theory 
can be found in [5]. Markov chain Monte Carlo is detailed in [50] and [99]. The research 
monograph on the cross-entropy method is [103] and a tutorial is provided in [30]. A range 
of optimization applications of the CE method is given in [16]. Theoretical results on ad- 
aptive tuning schemes for simulated annealing may be found, for example, in [111]. There 
are several established ways for gradient estimation. These include the finite difference 
method, infinitesimal perturbation analysis, the score function method, and the method of 
weak derivatives; see, for example, [51, Chapter 7]. 


Exercises 


1. We can modify the Box—Muller method in Example 3.1 to draw X and Y uniformly 
on the unit disc, {(x, y) € R? : x? +y? < 1}, in the following way: Independently draw 
a radius R and an angle © ~ U(0, 27), and return X = Rcos(®), Y = Rsin(@). The 
question is how to draw R. 


(a) Show that the cdf of R is given by FgR(r) = r? for O < r < 1 (with Fg(r) = 0 and 
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Fr(r) = 1 forr <0 andr > 1, respectively). 
(b) Explain how to simulate R using the inverse-transform method. 


(c) Simulate 100 independent draws of [X, Y]' according to the method described 
above. 


2. A simple acceptance-rejection method to simulate a vector X in the unit d-ball {x € 


R¢ : ||x|| < 1} is to first generate X uniformly in the hyper cube [—1, 1]? and then to 
accept the point only if ||X|| < 1. Determine an analytic expression for the probability 
of acceptance as a function of d and plot this for d = 1,...,50. 


. Let the random variable X have pdf 


v 


x, O<x 
l< 


IN A 
NI = 


X 


>’ 


NI NI 


f(x) = | 


Simulate a random variable from f(x), using 


(a) the inverse-transform method; 
(b) the acceptance-rejection method, using the proposal density 


8 5 
=— a7< 
g(x) 55° O<x 5 


. Construct simulation algorithms for the following distributions: 


(a) The Weib(a, A) distribution, with cdf F(x) = 1—e~“",, x > 0, where A > 0 and 
a> 0. 

(b) The Pareto(a, A) distribution, with pdf f(x) = aA(1 + Ax) @*) x > 0, where 
A>Oanda> 0. 


. We wish to sample from the pdf 


fx) =xe™, #20, 
using acceptance-rejection with the proposal pdf g(x) = e™™?/2, x > 0. 


(a) Find the smallest C for which Cg(x) > f(x) for all x. 


(b) What is the efficiency of this acceptance—rejection method? 


. Let [X, Y]' be uniformly distributed on the triangle with corners (0,0), (1,2), and 


(—1, 1). Give the distribution of [U, V]" defined by the linear transformation 


MB alt 


. Explain how to generate a random variable from the extreme value distribution, 


which has cdf 
F(~=1- eT Pe) , -œ <x<oo, (oc >0), 


via the inverse-transform method. 
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8. Write a program that generates and displays 100 random vectors that are uniformly 
distributed within the ellipse 


5x +21xy+25y =9. 


[Hint: Consider generating uniformly distributed samples within the circle of radius 
3 and use the fact that linear transformations preserve uniformity to transform the 
circle to the given ellipse.] 


9. Suppose that X; ~ Exp();), independently, for all i = 1,...,n. Let H = [Ih,...,1,]" 
be the random permutation induced by the ordering Xn, < Xn, < ++ < Xn, and 
define Z; := Xn, and Z; := Xn, — Xn,, for j = 2,...,n. 


(a) Determine an n x n matrix A such that Z = AX and show that det(A) = 1. 
(b) Denote the joint pdf of X and IT as 


fxu(x, 7) = | [h exp (—Àr;Xr;) x L{xz, <r Sls x2 0, TE Pin 


i=1 


where P, is the set of all n! permutations of {1,...,n}. Use the multivariate 
transformation formula (C.22) to show that 


n 


fzu(z, 7) = exp [- >, Zi p tn g Ài» z> 0, 7EP,. 
i=1 


i=1 kzi 


Hence, conclude that the probability mass function of the random permutation 
I is: 
T Ày 


PIM =x] =| | 


>’ 
i=1 Dei An 


(c) Write pseudo-code to simulate a uniform random permutation II € P,,; that is, 
such that PHI = x] = t, and explain how this uniform random permutation 
can be used to reshuffle a training set T,,. 





KEP,,. 


10. Consider the Markov chain with transition graph given in Figure 3.17, starting in 
state 1. 


0.5 





0.5 


Figure 3.17: The transition graph for the Markov chain {X,,¢ = 0,1,2,...}. 
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(a) Construct a computer program to simulate the Markov chain, and show a real- 
ization for N = 100 steps. 


(b) Compute the limiting probabilities that the Markov chain is in state 1,2,...,6, 
by solving the global balance equations (C.42). 


(c) Verify that the exact limiting probabilities correspond to the average fraction 
of times that the Markov process visits states 1,2,...,6, for a large number of 
steps N. 


11. As a generalization of Example C.9, consider a random walk on an arbitrary undir- 
ected connected graph with a finite vertex set V. For any vertex v € V, let d(v) be 
the number of neighbors of v — called the degree of v. The random walk can jump to 
each one of the neighbors with probability 1 /d(v) and can be described by a Markov 
chain. Show that, if the chain is aperiodic, the limiting probability that the chain is 
in state v is equal to d(v)/ Di vey div’). 


12. Let U,V ~iia UO, 1). The reason why in Example 3.7 the sample mean and sample 
median behave very differently is that E[U/V] = œ, while the median of U/V is 
finite. Show this, and compute the median. [Hint: start by determining the cdf of 
Z = U/V by writing it as an expectation of an indicator function. ] 


13. Consider the problem of generating samples from Y ~ Gamma(2, 10). 


(a) Direct simulation: Let U1, U2 ~ia U(O, 1). Show that — In(U,)/10—In(U2)/10 ~ 
Gamma(2, 10). [Hint: derive the distribution of —In(U,)/10 and use Ex- 
ample C.1.] 


(b) Simulation via MCMC: Implement an independence sampler to simulate from 
the Gamma(?2, 10) target pdf 


f(x) =100xe'", x20, 


using proposal transition density g(y|x) = (y), where g(y) is the pdf of an 
Exp(5) random variable. Generate N = 500 samples, and compare the true cdf 
with the empirical cdf of the data. 


14. Let X = [X, Y]' be a random column vector with a bivariate normal distribution with 
expectation vector u = [1,2]' and covariance matrix 


zeja al 


(a) What are the conditional distributions of (Y | X = x) and (X| Y = y)? [Hint: use 
Theorem C.8.] 


(b) Implement a Gibbs sampler to draw 10° samples from the bivariate distribution 
N(u, Ł) for a = 0, 1, and 1.75, and plot the resulting samples. 


15. Here the objective is to sample from the 2-dimensional pdf 
fay) = cet, x>0, y>0, 


for some normalization constant c, using a Gibbs sampler. Let (X, Y) ~ f. 


Chapter 3. Monte Carlo Methods 





(a) Find the conditional pdf of X given Y = y, and the conditional pdf of Y given 
xX =X. 


(b) Write working Python code that implements the Gibbs sampler and outputs 
1000 points that are approximately distributed according to f. 
(c) Describe how the normalization constant c could be estimated via Monte Carlo 
simulation, using random variables X,,...,Xw,Y1,..., Yy a Exp(1). 
16. We wish to estimate u = f p er 2 dx = f A(x) f(x) dx via Monte Carlo simulation 


using two different approaches: (1) defining H(x) = 4e/? and f the pdf of the 
U[-2, 2] distribution and (2) defining H(x) = V27 1{-2 < x < 2} and f the pdf of 
the N(0, 1) distribution. 


(a) For both cases estimate u via the estimator i 


N 
i= N7 > H(X). (3.34) 
i=1 


Use a sample size of N = 1000. 
(b) For both cases estimate the relative error x of H using N = 100. 
(c) Give a 95% confidence interval for u for both cases using N = 100. 


(d) From part (b), assess how large N should be such that the relative width of the 
confidence interval is less than 0.01, and carry out the simulation with this N. 
Compare the result with the true value of u. 


17. Consider estimation of the tail probability u = P[X > y] of some random variable X, 
where y is large. The crude Monte Carlo estimator of u is 


1 N 
Tsay Z, 3.35 
Bes 2, (3.35) 


where X,,..., Xn are iid copies of X and Z; = 1{X; > y},i=1,...,N. 
(a) Show that 77 is unbiased; that is, E H = u. 
(b) Express the relative error of 4, i.e., 


RE = yVaru 


Eu 





>, 


in terms of N and 4. 


(c) Explain how to estimate the relative error of H from outcomes x,,...,xy of 
Xı,..., Xy, and how to construct a 95% confidence interval for u. 


(d) An unbiased estimator Z of u is said to be logarithmically efficient if 


In EZ? 
a |, (3.36) 


yoo In we 





Show that the CMC estimator (3.35) with N = 1 is not logarithmically efficient. 
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20. 
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One of the test cases in [70] involves the minimization of the Hougen function. Im- 
plement a cross-entropy and a simulated annealing algorithm to carry out this optim- 
ization task. 


In the binary knapsack problem, the goal is to solve the optimization problem: 


2 
max p'x, 
xe{0,1}” P 
subject to the constraints 
Ax Sc, 


where p and w are n X | vectors of non-negative numbers, A = (a;j) is an m xX n 
matrix, and c is an m X | vector. The interpretation is that x; = 1 or 0 depending 
on whether item j with value p; is packed into the knapsack or not , j = 1,...,n; 
The variable a;; represents the i-th attribute (e.g., volume, weight) of the j-th item. 
Associated with each attribute is a maximal capacity, e.g., cı could be the maximum 
volume of the knapsack, cz the maximum weight, etc. 


Write a CE program to solve the Sento1.dat knapsack problem at http: //peop 
le.brunel.ac.uk/~mastjjb/jeb/orlib/files/mknap2.txt, as described in 
[16]. 


Let (C1, R1), (C2, R2),... be a renewal reward process, with ER; <œ and 
EC, < œ. Let A; = > Rit be the average reward at time t = 1,2,..., where 
N, = max{n : T, < t} and we have defined T, = X; C; as the time of the n-th re- 
newal. 


(a) Show that T,,/n— EC, asn > œ. 


(b) Show that N,—> co as t > ov, 


(c) Show that N,/t—> 1/EC, as t > oo. [Hint: Use the fact that Ty, < t < Ty,41 for 
allt =1,2,....] 


(d) Show that 


Prove Theorem 3.3. 


. Prove that if H(x) > 0 the importance sampling pdf g* in (3.22) gives the zero- 


variance importance sampling estimator H = p. 


Let X and Y be random variables (not necessarily independent) and suppose we wish 
to estimate the expected difference u = ELX — Y] = EX - EY. 


(a) Show that if X and Y are positively correlated, the variance of X — Y is smaller 
than if X and Y are independent. 


(b) Suppose now that X and Y have cdfs F and G, respectively, and are 
simulated via the inverse-transform method: X = F-'(U), Y = G"'(V), with 
U,V ~ UO, 1), not necessarily independent. Intuitively, one might expect that 
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(c 


wm 


if U and V are positively correlated, the variance of X— Y would be smaller than 
if U and V are independent. Show that this is not always the case by providing 
a counter-example. 


Continuing (b), assume now that F and G are continuous. Show that the vari- 
ance of X — Y by taking common random numbers U = V is no larger than 
when U and V are independent. [Hint: Use the following lemma of Hoeffding 
[41]: If (X, Y) have joint cdf H with marginal cdfs of X and Y being F and G, 
respectively, then 


Cov(X, Y) = f. T (H(x, y) — F(x) G(y)) dx dy, 


provided Cov(X, Y) exists.] 
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CHAPTER 4 





UNSUPERVISED LEARNING 





When there is no distinction between response and explanatory variables, unsu- 
pervised methods are required to learn the structure of the data. In this chapter we 
look at various unsupervised learning techniques, such as density estimation, cluster- 
ing, and principal component analysis. Important tools in unsupervised learning in- 
clude the cross-entropy training loss, mixture models, the Expectation—Maximization 
algorithm, and the Singular Value Decomposition. 


4.1 Introduction 


In contrast to supervised learning, where an “output” (response) variable y is explained by 
an “input” (explanatory) vector x, in unsupervised learning there is no response variable 
and the overall goal is to extract useful information and patterns from the data, e.g., in 
the form T = {x,,...,x,} or as a matrix X' =[x,,...,x,]. In essence, the objective of 
unsupervised learning is to learn about the underlying probability distribution of the data. 

We start in Section 4.2 by setting up a framework for unsupervised learning that is 
similar to the framework used for supervised learning in Section 2.3. That is, we formulate 
unsupervised learning in terms of risk and loss minimization; but now involving the cross- 
entropy risk, rather than the squared-error risk. In a natural way this leads to fundamental 
learning concepts such as likelihood, Fisher information, and the Akaike information cri- 
terion. Section 4.3 introduces the Expectation—Maximization (EM) algorithm as a useful 
method for maximizing likelihood functions when their solution cannot be found easily in 
closed form. 

If the data forms an iid sample from some unknown distribution, the “empirical dis- 
tribution” of the data provides valuable information about the unknown distribution. In 
Section 4.4 we formalize the concept of the empirical distribution (a generalization of the 
empirical cdf) and explain how we can produce an estimate of the underlying probability 
density function of the data using kernel density estimators. 

Most unsupervised learning techniques focus on identifying certain traits of the under- 
lying distribution, such as its local maximizers. A related idea is to partition the data into 
clusters of points that are in some sense “similar” to each other. In Section 4.5 we formu- 
late the clustering problem in terms of a mixture model. In particular, the data are assumed 
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to come from a mixture of (usually Gaussian) distributions, and the objective is to recover 
the parameters of the mixture distributions from the data. The principal tool for parameter 
estimation in mixture models is the EM algorithm. 

Section 4.6 discusses a more heuristic approach to clustering, where the data are 
grouped according to certain “cluster centers”, whose positions are found by solving an 
optimization problem. Section 4.7 describes how clusters can be constructed in a hierarch- 
ical manner. 

Finally, in Section 4.8 we discuss the unsupervised learning technique called Principal 
Component Analysis (PCA), which is an important tool for reducing the dimensionality of 
the data. 

We will revisit various unsupervised learning techniques in subsequent chapters on su- 
pervised learning. For example, cross-entropy training loss minimization will be important 
in logistic regression (Section 5.7) and classification (Chapter 7), and PCA can be used 
for variable selection and dimensionality reduction, to make models easier to train and 
increase their predictive power; see e.g., Sections 6.8 and 7.4. 


4.2 Risk and Loss in Unsupervised Learning 


In unsupervised learning, the training data T := {X,,...,X,} only consists of (what are 
usually assumed to be) independent copies of a feature vector X; there is no response 
data. Suppose our objective is to learn the unknown pdf f of X based on an outcome 
T = {X1,...,X,} of the training data 7. Conveniently, we can follow the same line of reas- 
oning as for supervised learning, discussed in Sections 2.3—2.5. Table 4.1 gives a summary 
of definitions for the case of unsupervised learning. Compare this with Table 2.1 for the 
supervised case. 

Similar to supervised learning, we wish to find a function g, which is now a probability 
density (continuous or discrete), that best approximates the pdf f in terms of minimizing a 
risk 

&(g) := E Loss( f(X), g(X)), (4.1) 


where Loss is a loss function. In (2.27), we already encountered the Kullback—Leibler risk 


f(g) := Eln 1 at f(X) - Elng(X). (4.2) 


g(X) 
If G is a class of functions that contains f, then minimizing the Kullback—Leibler risk over 
G will yield the (correct) minimizer f. Of course, the problem is that minimization of (4.2) 
depends on f, which is generally not known. However, since the term E 1n f(X) does not 
depend on g, it plays no role in the minimization of the Kullback—Leibler risk. By removing 
this term, we obtain the cross-entropy risk (for discrete X replace the integral with a sum): 


f(g) := -Eln g(X) = - Í f(x) In g(x) dx. (4.3) 


Thus, minimizing the cross-entropy risk (4.3) over all g € G, again gives the minimizer 
f, provided that f € G. Unfortunately, solving (4.3) is also infeasible in general, as it still 
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Table 4.1: Summary of definitions for unsupervised learning. 


x Fixed feature vector. 

X Random feature vector. 

f(x) Pdf of X evaluated at the point x. 

T OL Ty Fixed training data {x;,i = 1,..., n}. 

FT or Ty Random training data {X;,i = 1,...,n}. 

g Approximation of the pdf f. 

Loss(f(x), g(x)) Loss incurred when approximating f(x) with g(x). 
Risk for approximation function g; that is, E Loss(f(X), g(X)). 
Optimal approximation function in function class G; that is, 
argmin eg €(g). 
Training loss for approximation function (guess) g; that is, 
the sample average estimate of f(g) based on a fixed training 
sample T. 
The same as ¢,(g), but now for a random training sample 7. 
The learner: argmin,-g €-(g). That is, the optimal approxima- 
tion function based on a fixed training set t and function class 
G. We suppress the superscript G if the function class is impli- 
cit. 
The learner for a random training set 7. 


depends on f. Instead, we seek to minimize the cross-entropy training loss: 


1 n 1 n 
L(g) == = J Loss), g(@))) = —— Y ng) (4.4) 
i=1 i=1 


over the class of functions G, where T = {x1,..., Xn} is an iid sample from f. This optimiz- 
ation is doable without knowing f and is equivalent to solving the maximization problem 


max 2, In g(x;). (4.5) 


A key step in setting up the learning procedure is to select a suitable function class G over 
which to optimize. The standard approach is to parameterize g with a parameter 0 and let 
G be the class of functions {g(-| 0), 0 € ©} for some p-dimensional parameter set ©. For the 
remainder of Section 4.2, we will be using this function class, as well as the cross-entropy 
risk. 


The function 0 + g(x|@) is called the likelihood function. It gives the likelihood of 
the observed feature vector x under g(-|@), as a function of the parameter 0. The natural 
logarithm of the likelihood function is called the log-likelihood function and its gradient 
with respect to 0 is called the score function, denoted S(x |0); that is, 


Og(x | 6) 
dIng(x|0@) ay 


00 g(x | 0) 





S(x|0) := (4.6) 
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The random score S(X |0), with X ~ g(- |0), is of particular interest. In many cases, its 
expectation is equal to the zero vector; namely, 











EoS(X |0) = 0) dx 

oS(X |0) alo) E! ) E 
= f EGD ay = PLEO A o 
7 00 p 00 — OO” 


provided that the interchange of differentiation and integration is justified. This is true for 
a large number of distributions, including the normal, exponential, and binomial distri- 
butions. Notable exceptions are distributions whose support depends on the distributional 
parameter; for example the U(0, 0) distribution. 


It is important to see whether expectations are taken with respect to X ~ g(- |0) or 


X ~ f. We use the expectation symbols Eg and E to distinguish the two cases. 





From now on we simply assume that the interchange of differentiation and integration 
is permitted; see, e.g., [76] for sufficient conditions. The covariance matrix of the random 
score S(X |0) is called the Fisher information matrix, which we denote by F or F(@) to 
show its dependence on @. Since the expected score is 0, we have 








F(0) = Eg[S(X | 0) S(X|6)"]. (4.8) 
A related matrix is the expected Hessian matrix of — In g(X | 0): 
FP Ing(X|0) Ingl |. A Ing(X|6) 
076, 00,06 0100p 

; P Ing(X|0)  Ping(X10) A Ing(X16) 

H(6) :=E _IS(X; 0) eg 00200 PO 00200, . (4.9) 
00 j : ; 
A Ing(X|0) Ping(X10) A lng(X10) 
00,00) 30,002 06, 


Note that the expectation here is with respect to X ~ f. It turns out that if f = g(-| 6), the 
two matrices are the same; that is, 


F(0) = H(@), (4.10) 


provided that we may swap the order of differentiation and integration (expectation). This 
result is called the information matrix equality. We leave the proof as Exercise 1. 


The matrices F(@) and H(@) play important roles in approximating the cross-entropy 
risk for large n. To set the scene, let g4 = g(-|0*) be the minimizer of the cross-entropy 
risk 

r(0) := -E In g(X |0). 
We assume that r, as a function of 0, is well-behaved; in particular, that in the neighborhood 
of @* it is strictly convex and twice continuously differentiable (this holds true, for example, 
if g is a Gaussian density). It follows that 6* is a root of E S(X | 0), because 
_ Or") — DE Ing(X|6") pl na(X 16") 


r 8 IO 
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again provided that the order of differentiation and integration (expectation) can be 
swapped. In the same way, H(@) is then the Hessian matrix of r. Let g(-|6,,) be the minim- 
izer of the training loss 


1 n 
i=1 


where T, = {X1,..., Xn} is a random training set. Let r* be the smallest possible cross- 
entropy risk, taken over all functions; clearly, r* = -E In f(X), where X ~ Í. Similar to 
the supervised learning case, we can decompose the generalization risk, €(g(-|0,,)) = r(0,,), 
into . 

r(6,) = r +r(0°) —r° +7r@,) — (6°) = r(6") —Eln KAIN 


approx. error statistical error 


The following theorem specifies the asymptotic behavior of the components of the gener- 


. . . on P * 
alization risk. In the proof we assume that 6, —> 0° as n > oo, 


Theorem 4.1: Approximating the Cross-Entropy Risk 





Proof: A Taylor expansion of r(O,) around @* gives the statistical error 


, Or(&) 
00 

Sa 
=0 





r(0,) — (0°) = 0, - 0°) +50, - HONO, — 9°), (4.13) 


where 8, lies on the line segment between 0° and 6,,. For large n we may replace H(6,,) with 
H(6") as, by assumption, 6, converges to 6°. The matrix H(@*) is positive definite because 
r(Q) is strictly convex at 6* by assumption, and therefore invertible. It is important to realize 
that 6, is in fact an M-estimator of 6°. In particular, in the notation of Theorem C.19, we 
have w = S, A = H(@*), and B = F(@*). Consequently, by that same theorem, 


Vn 6, - 6°) > N (0, H16) F(@") H-"(6")). (4.14) 


Combining (4.13) with (4.14), it follows from Theorem C.2 that asymptotically the 
expected estimation error is given by (4.11). _ 
Next, we consider a Taylor expansion of r7.,(6") around 8,„: 


rr, (0n) l 
æ 2 
=0 


rr (0°) = rr, (0n) + (0* — ,)" (6° — 0,) H7, 0,6 — 0,), (4.15) 
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where Hz (0,) := -+ Di, 351114.) is the Hessian of rz,(0) at some 8, between 6,, and 6°. 
Taking expectations on both sides of (4.15), we obtain 


— 1 = = = 
r(@") = Err, (0,) + ae (Oo B 0,)' H7 (0,)(0* = 0n). 
Replacing Hy, (6,,) with H(0*) for large n and using (4.14), we have 


nE (6 — 0,) Hr, 0,6 - 0,) — tr (F09 H 6)), n> o. 


Therefore, asymptotically as n — œ, we have (4.12). Oo 


Theorem 4.1 has a number of interesting consequences: 


1. Similar to Section 2.5.1, the training loss €r, (g7,) = rr, (0n) tends to underestimate the 
risk €(g9) = r(6*), because the training set J, is used to both train g € G (that is, estimate 
6") and to estimate the risk. The relation (4.12) tells us that on average the training loss 
underestimates the true risk by tr(F(0*) H7! (6*))/(2n). 


2. Adding equations (4.11) and (4.12), yields the following asymptotic approximation to 
the expected generalization risk: 


E r(0,) ~ Ery,(On) + Li (F0) H6") (4.16) 


The first term on_ the right-hand side of (4.16) can be estimated (without bias) via the 
training loss r7,(0,). As for the second term, we have already mentioned that when the 
true model f € G, then F(@") = H(6"). Therefore, when G is deemed to be a sufficiently 
rich class of models parameterized by a p-dimensional vector 0, we may approximate the 
second term as tr(F(0*)H-'!(6*))/n = tr(I,)/n = p/n. This suggests the following heuristic 
approximation to the (expected) generalization risk: 


Ern) = rr, 0n) + A (4.17) 


3. Multiplying both sides of (4.16) by 2n and substituting tr (EHe) x p, we obtain 
the approximation: 

2n r(O,) x -2 > In g(X: 10) + 2p. (4.18) 

i=l 

The right-hand side of (4.18) is called the Akaike information criterion (AIC). Just like 
(4.17), the AIC approximation can be used to compare the difference in generalization risk 
of two or more learners. We prefer the learner with the smallest (estimated) generalization 
risk. 


Suppose that, for a training set 7, the training loss r7-(0) has a unique minimum point 
0 which lies in the interior of ©. If r7(@) is a differentiable function with respect to 0, then 
we can find the optimal parameter 0 by solving 


Or7(O) _ 
ð 





LY $(X;|6) = 0. 
M i=l 


— amm 
Sr (0) 
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In other words, the maximum likelihood estimate @ for 8 is obtained by solving the root of 
the average score function, that is, by solving 


S7(0) = 0. (4.19) 


It is often not possible to find @ in an explicit form. In that case one needs to solve the 
equation (4.19) numerically. There exist many standard techniques for root-finding, e.g., 
via Newton’s method (see Section B.3.1), whereby, starting from an initial guess 6, sub- knon 
sequent iterates are obtained via the iterative scheme METHOD 


IF 
Br = 0; + H71 (0,) S7(0:), que 


where 
H _ —OS7(0) 1% OS(X;|9) 
TO) = ag = 2 00 


i=l 
is the average Hessian matrix of {—In g(X; |0)}_;. Under f = g(-|@), the expectation of 
H,;-(@) is equal to the information matrix F(@), which does not depend on the data. This 


suggests an alternative iterative scheme, called Fisher’s scoring method: FISHER’ S 
SCORING METHOD 


Oi = 0; + F'(6,) S7(;), (4.20) 


which is not only easier to implement (if the information matrix can be readily evaluated), 
but also is more numerically stable. 


E Example 4.1 (Maximum Likelihood for the Gamma Distribution) We wish to ap- 
proximate the density of the Gamma(a‘, 4°) distribution for some true but unknown para- 
meters a* and 4*, on the basis of a training set T = {x1,..., Xn} of iid samples from this 
distribution. Choosing our approximating function g(-|a@, A) in the same class of gamma 
densities, 
AC xe len at 
Ta)” 
with a > 0 and A> 0, we seek to solve (4.19). Taking the logarithm in (4.21), the log- 
likelihood function is given by 


g(x|a, Aa) = x>0, (4.21) 


l(x|a,A):= alnd-—InI(a@) + (a — 1) Inx - Ax. 


It follows that 


S(a,a) = ee N _ m - y(æ)+lnx 





A(x læ, A) rier: i 
where y is the derivative of lIn T: the so-called digamma function. Hence, TTN 
FUNCTION 
& & 
sl(X a,a -— (X a,a h! 1 , _1 
H(a, à) = =| BON) gaat 1 =f Ale F] 
GXi, A) NX la, A) 1 gjj- g 


Fisher’s scoring method (4.20) can now be used to solve (4.19), with 


Ind- yla) +n X; In "| 
ried n! Y Xi 


and F(a, a) = H(a, A). o 


S-(a, A) = | 
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4.3 Expectation—Maximization (EM) Algorithm 


The Expectation—Maximization algorithm (EM) is a general algorithm for maximization of 
complicated (log-)likelihood functions, through the introduction of auxiliary variables. 


To simplify the notation in this section, we use a Bayesian notation system, where 


the same symbol is used for different (conditional) probability densities. 





LATENT 
VARIABLES 


COMPLETE-DATA 
LIKELIHOOD 


us 42 


As in the previous section, given independent observations T = {x;,...,X,} from some 
unknown pdf f, the objective is to find the best approximation to f in a function class 
G = {2(- |0), 8 € ©} by solving the maximum likelihood problem: 


0* = argmax g(T |0), (4.22) 
dcO 


where g(T|0) := g(x) |0)--- g(x,|0). The key element of the EM algorithm is the aug- 
mentation of the data t with a suitable vector of latent variables, z, such that 


g(t|@) = Jeezaa 


The function 0 +> g(t, z |0) is usually referred to as the complete-data likelihood function. 
The choice of the latent variables is guided by the desire to make the maximization of 
g(t, Z|) much easier than that of g(t | 0). 

Suppose p denotes an arbitrary density of the latent variables z. Then, we can write: 


In g(t |) = f p(z) In (T |0)dz 


g(T,z|0)/p(2) 
= In | ———————— jd 
Í P(2) baal i 


= f von) a- f pom) 
p(z) p(z) 


,z|0 
= i pom (El) az + Dp. 9¢17 9. (4.23) 


where D(p, g(- |t, 0)) is the Kullback—Leibler divergence from the density p to g(-|T, 0). 
Since D > 0, it follows that 





a(t, z 10) 
p(z) 


for all 0 and any density p of the latent variables. In other words, £(p, 0) is a lower bound 
on the log-likelihood that involves the complete-data likelihood. The EM algorithm then 
aims to increase this lower bound as much as possible by starting with an initial guess 0 
and then, for t = 1,2,..., solving the following two steps: 


In g(t |0) > fro in| Jac =: L(p, 0) 


1. p® = argmax, L(p, 0), 


2. 0 = argmaxy.g L(p”, 0). 
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The first optimization problem can be solved explicitly. Namely, by (4.23), we have 
that 


p” = argmin D(p, gC |t, 6) = gl, 6"), 
P 


That is, the optimal density is the conditional density of the latent variables given the data 
t and the parameter 6“. The second optimization problem can be simplified by writing 
L(p®, 0) = Q” (0) — Epo In p(Z), where 


QO (0) := Epo In g(t, Z|) 


is the expected complete-data log-likelihood under Z ~ p®. Consequently, the maximiza- 
tion of L(p™, 0) with respect to 0 is equivalent to finding 


0 = argmax Q” (8). 
dcO 


This leads to the following generic EM algorithm. 


Algorithm 4.3.1: Generic EM Algorithm 
input: Data 7, initial guess 0. 
output: Approximation of the maximum likelihood estimate. 
ite l1 
while a stopping criterion is not met do 
3 | Expectation Step: Find p(z) := g(z |r, 0) and compute the expectation 


N 


(0) := E,w In g(r, Z |0). (4.24) 


4 | Maximization Step: Let 0 — argmax,-¢ 0 (6). 
5 | te t+1 





return 0” 


n 


A possible stopping criterion is to stop when 


In g(t | 6%) - Ing(t|O""))| _ 
In g(t 16) 


for some small tolerance € > 0. 


E Remark 4.1 (Properties of the EM Algorithm) The identity (4.23) can be used to 
show that the likelihood g(t |@) does not decrease with every iteration of the algorithm. 
This property is one of the strengths of the algorithm. For example, it can be used to debug 
computer implementations of the EM algorithm: if the likelihood is observed to decrease 
at any iteration, then one has detected a bug in the program. 

The convergence of the sequence {6} to a global maximum (if it exists) is highly 
dependent on the initial value @ and, in many cases, an appropriate choice of 0 may not 
be clear. Typically, practitioners run the algorithm from different random starting points 
over ©, to ascertain empirically that a suitable optimum is achieved. E 
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m Example 4.2 (Censored Data) Suppose the lifetime (in years) of a certain type of 
machine is modeled via a N(u, o?) distribution. To estimate u and o”, the lifetimes of 
n (independent) machines are recorded up to c years. Denote these censored lifetimes 
by x1,...,X,. The {x;} are thus realizations of iid random variables {X;}, distributed as 
min{Y,c}, where Y ~ N(u, o°). 

By the law of total probability (see (C.9)), the marginal pdf of each X can be written 


as: 
Mines Oe wie ged eG ie) 1k: 
P[Y<c] me 7 Wia) P[Y èc] 


where ¢,2(-) is the pdf of the N(0, o?) distribution, ® is the cdf of the standard normal 
distribution, and ® := 1 — ®. It follows that the likelihood of the data T = {x1,..., Xn} as a 
function of the parameter 0 := [u,0]" is: 





_ i=)? ) 


o=] | n x | [Ee - w/o. 


Let n, be the total number of x; such that x; = c. Using n, latent variables z = [Z,,...,Zn,]', 
we can write the joint pdf: 





F _— ty Ne ood pve 
g(t, z|0) = _Ditxicc i> Lei Zi -A ji {min z > cl. 


Quo*y? exp| 20? 20? 
so that f g(t, Z|@) dz = g(t| 6). We can thus apply the EM algorithm to maximize the like- 


lihood, as follows. 
For the E(xpectation)-step, we have for a fixed 0: 


g(zit,0) =| [eilt 0, 
i=1 


where g(z|T,9) = I{z > chy (z - p)/®((c — t)/o) is simply the pdf of the N(u, o°) 
distribution, truncated to [c, œ). 

For the M(aximization)-step, we compute the expectation of the complete log- 
likelihood with respect to a fixed g(z |T, 0) and use the fact that Z4, . . . , Zn, are iid: 


_LinccOi- HW)” nBZ- pw) n 
2 207 2 
where Z has a N(u, 0) distribution, truncated to [c, co). To maximize the last expression 
with respect to u we set the derivative with respect to u to zero, and obtain: 

n,EZ + pee Xi 


n 





Eln g(t, Z |0) = Ing? - 5 In(2m), 


20° 


Similarly, setting the derivative with respect to o to zero gives: 


: | 


In summary, the EM iterates for t = 1,2,... are as follows. 
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E-step. Given the current estimate 0, := [u,,07]", compute the expectations v, := EZ and 
C := E(Z — u), where Z ~ N(u,, 07), conditional on Z > c; that is, 


"De = p) /0;) 
Co í + (C — p) Bai B ) . 
(c - Ly) /O;) 


V, = to 


2 


%11" via the formulas: 


M-step. Update the estimate to 6,4; := [Ur 0 


NcVt + are Xi 
Hmi 5 — 


n 
2 2 

2 Nel, + Degen = Hei) 

t+1 x 





Oo 
n 


4.4 Empirical Distribution and Density Estimation 


In Section 1.5.2.3 we saw how the empirical cdf F,,, obtained from an iid training set 
T = {x,,...,X,} from an unknown distribution on R, gives an estimate of the unknown cdf 
F of this sampling distribution. The function F, nis a genuine cdf, as it is right-continuous, 
increasing, and lies between 0 and 1. The corresponding discrete probability distribution 
is called the empirical distribution of the data. A random variable X distributed according 
to this empirical distribution takes the values x),...,x, with equal probability 1/n. The 
concept of empirical distribution naturally generalizes to higher dimensions: a random 
vector X that is distributed according to the empirical distribution of x1, . . . , x, has discrete 
pdf P[X = x;] = 1/n,i = 1,...,n. Sampling from such a distribution — in other words 
resampling the original data — was discussed in Section 3.2.4. The preeminent usage of 
such sampling is the bootstrap method, discussed in Section 3.3.2. 

In a way, the empirical distribution is the natural answer to the unsupervised learning 
question: what is the underlying probability distribution of the data? However, the empir- 
ical distribution is, by definition, a discrete distribution, whereas the true sampling distri- 
bution might be continuous. For continuous data it makes sense to also consider estimation 
of the pdf of the data. A common approach is to estimate the density via a kernel density 
estimate (KDE), the most prevalent learner to carry this out is given next. 


Definition 4.1: Gaussian KDE 


Let x),...,x, € R? be the outcomes of an iid sample from a continuous pdf f. A 
Gaussian kernel density estimate of f is a mixture of normal pdfs, of the form 


_ x=xil? 


1< 1 
gr, (xX | j= A > Origi em ic R$, (4.25) 
i=1 


where o > 0 is called the bandwidth. 
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We see that g,, in (4.25) is the average of a collection of n normal pdfs, where each 
normal distribution is centered at the data point x; and has covariance matrix o07Iy. A major 
question is how to choose the bandwidth o so as to best approximate the unknown pdf f. 
Choosing very small o will result in a “spiky” estimate, whereas a large o will produce 
an over-smoothed estimate that may not identify important peaks that are present in the 
unknown pdf. Figure 4.1 illustrates this phenomenon. In this case the data are comprised 
of 20 points uniformly drawn from the unit square. The true pdf is thus 1 on [0, 1]? and 0 
elsewhere. 


80 
60- 
40 | 


20- 











0.5 





0 i 0 


0 = 0 
-0.5 -0.5 -0.5 -0.5 


Figure 4.1: Two two-dimensional Gaussian KDEs, with o = 0.01 (left) and o = 0.1 (right). 


Let us write the Gaussian KDE in (4.25) as 





glo = 2 La (55), (4.26) 


where 


1 _ Wal? d 
(z) = man © 2, ZER (4.27) 
is the pdf of the d-dimensional standard normal distribution. By choosing a different prob- 
ability density ¢ in (4.26), satisfying ¢(x) = ¢(—x) for all x, we can obtain a wide variety 
of kernel density estimates. A simple pdf ¢ is, for example, the uniform pdf on [—1, 1]¢: 


24, ifze[-1,1] 
s=] oo 


0, otherwise. 


Figure 4.2 shows the graph of the corresponding KDE, using the same data as in Figure 4.1 
and with bandwidth o = 0.1. We observe qualitatively similar behavior for the Gaussian 
and uniform KDEs. As a rule, the choice of the function ¢ is less important than the choice 
of the bandwidth in determining the quality of the estimate. 

The important issue of bandwidth selection has been extensively studied for one- 
dimensional data. To explain the ideas, we use our usual setup and let T = {x1,..., Xn} 
be the observed (one-dimensional) data from the unknown pdf f. First, we define the loss 
function as 
(F(x) = goy 


4.28 
f(x) Soe 


Loss(f(x), g(x)) = 
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-0.5 -0.5 


Figure 4.2: A two-dimensional uniform KDE, with bandwidth o = 0.1. 


The risk to minimize is thus f(g) := E,Loss(f(X), g(X)) = f (f(x) — g(x)? dx. We bypass 
the selection of a class of approximation functions by choosing the learner to be specified 
by (4.25) for a fixed ø. The objective is now to find a ø that minimizes the generalization 
risk €(g,(-|o)) or the expected generalization risk Ef(g7(-|o)). The generalization risk is 
in this case 


f (f(x) - g(x|o))° dx = f f(x) dx -2 f fex |o) dx + Í g(x|o) dx. 


Minimizing this expression with respect to o is equivalent to minimizing the last two terms, 
which can be written as 


P 2 
-orein SES (E) dx. 
i=1 


This expression in turn can be estimated by using a test sample {x}... , X, } from f, yielding 
the following minimization problem: 


~~ oe Iver 1 (x-—%\ , (4-4; 
-— (x +— — —— |dx, 
Eg m Dy 812) DA o Jol o i 


where f 4¢ (=) $ (=> ) dx = Tab (=) in the case of the Gaussian kernel (4.27) with 
d = 1. To estimate o in this way clearly requires a test sample, or at least an application of 
cross-validation. Another approach is to minimize the expected generalization risk, (that 


is, averaged over all training sets): 














E if (f(x) — gr(x| oy dx. 


This is called the mean integrated squared error (MISE). It can be decomposed into an MEAN INTEGRATED 


integrated squared bias and integrated variance component: 


f (f(x) - Egr(x| 0)? dx + T Var(g7(x|o)) dx. 


SQUARED ERROR 
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GAUSSIAN RULE 
OF THUMB 


THETA KDE 


A typical analysis now proceeds by investigating how the MISE behaves for large n, under 
various assumptions on f. For example, it is shown in [114] that, for 7 — 0 and no —> œ, 
the asymptotic approximation to the MISE of the Gaussian kernel density estimator (4.25) 
(for d = 1) is given by 


l , ‘ 1 
=o" If" + — c, (4.29) 
4 2n Vno? 
where || f”? := f (f’’(x))? dx. The asymptotically optimal value of o is the minimizer 
l 1/5 
œ := | —] . (4.30) 
z va "aH 


To compute the optimal o* in (4.30), one needs to estimate the functional || f”. The 
Gaussian rule of thumb is to assume that f is the density of the N(x, s?) distribution, where 
x and s? are the sample mean and variance of the data, respectively [113]. In this case 
If? = s>27'/*3/8 and the Gaussian rule of thumb becomes: 


4 3\"/5 
om = ("| siden”. 
3n 


We recommend, however, the fast and reliable theta KDE of [14], which chooses the 
bandwidth in an optimal way via a fixed-point procedure. Figures 4.1 and 4.2 illustrate a 
common problem with traditional KDEs: for distributions on a bounded domain, such as 
the uniform distribution on [0, 1], the KDE assigns positive probability mass outside this 
domain. An additional advantage of the theta KDE is that it largely avoids this boundary 
effect. We illustrate the theta KDE with the following example. 


E Example 4.3 (Comparison of Gaussian and theta KDEs) The following Python pro- 
gram draws an iid sample from the Exp(1) distribution and constructs a Gaussian kernel 
density estimate. We see in Figure 4.3 that with an appropriate choice of the bandwidth 
a good fit to the true pdf can be achieved, except at the boundary x = 0. The theta KDE 
does not exhibit this boundary effect. Moreover, it chooses the bandwidth automatically, 
to achieve a superior fit. The theta KDE source code is available as kde. py on the book’s 
GitHub site. 


1.0 


— Gaussian KDE 
— Theta KDE 
=-=- True density 


0.8 


0.6 


0.4 


0.2 





0.0 
0 1 2 3 4 5 6 


Figure 4.3: Kernel density estimates for Exp(1)-distributed data. 
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gausthetakde. py 


import matplotlib.pyplot as plt 
import numpy as np 
from kde import * 


sig = 9.1; sig2 = sig**2; c = 1/np.sqrt(2*np.pi)/sig #Constants 
phi = lambda x,x0: np.exp(-(x-x0)**2/(2*sig2)) #Unscaled Kernel 

= lambda x: np.exp(-x)*(x >= 0) # True PDF 

= 10**4 # Sample Size 

= -np.log(np.random.uniform(size=n))# Generate Data via IT method 
xx = np.arange(-0.5,6,0.01, dtype = "d")# Plot Range 
phis = np.zeros(len(xx)) 
for i in range(Q,n): 

phis = phis + phi(xx,x[i]) 

phis = c*phis/n 
plt.plot(xx,phis,'r')# Plot Gaussian KDE 
[bandwidth,density,xmesh,cdf] = kde(x,2**12,0,max(x)) 
idx = (xmesh <= 6) 
plt.plot(xmesh[idx],density[idx])# Plot Theta KDE 
plt.plot(xx,f£(xx))# Plot True PDF 





4.5 Clustering via Mixture Models 


Clustering is concerned with the grouping of unlabeled feature vectors into clusters, such 
that samples within a cluster are more similar to each other than samples belonging to 
different clusters. Usually, it is assumed that the number of clusters is known in advance, 
but otherwise no prior information is given about the data. Applications of clustering can 
be found in the areas of communication, data compression and storage, database searching, 
pattern matching, and object recognition. 

A common approach to clustering analysis is to assume that the data comes from a mix- 
ture of (usually Gaussian) distributions, and thus the objective is to estimate the parameters 
of the mixture model by maximizing the likelihood function for the data. Direct optimiza- 
tion of the likelihood function in this case is not a simple task, due to necessary constraints 
on the parameters (more about this later) and the complicated nature of the likelihood func- 
tion, which in general has a great number of local maxima and saddle-points. A popular 
method to estimate the parameters of the mixture model is the EM algorithm, which was 
discussed in a more general setting in Section 4.3. In this section we explain the basics of rs 128 
mixture modeling and explain the workings of the EM method in this context. In addition, 
we show how direct optimization methods can be used to maximize the likelihood. 


4.5.1 Mixture Models 


Let T := {X,,...,X,} be iid random vectors taking values in some set X C Rf, each X; 
being distributed according to the mixture density MIXTURE DENSITY 


g(x | A) = widi (xX) + +++ + wrdbk(x), xX EX, (4.31) 
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where ¢,...,¢x are probability densities (discrete or continuous) on X, and the positive 
aaa weights w,,...,Wx sum up to |. This mixture pdf can be interpreted in the following way. 
Let Z be a discrete random variable taking values 1, 2,..., K with probabilities w),...,wx, 
and let X be a random vector whose conditional pdf, given Z = z, is ¢,. By the product rule 
ns 43] (C.17), the joint pdf of Z and X is given by 


zxz, x) = b2(Z) bx\z(* |Z) = w: ġ-(x) 


and the marginal pdf of X is found by summing the joint pdf over the values of z, which 
gives (4.31). A random vector X ~ g can thus be simulated in two steps: 


1. First, draw Z according to the probabilities P[Z = z] = w, z = 1,..., K. 
2. Then draw X according to the pdf ¢z. 


As J only contain the {X;} variables, the {Z;} are viewed as latent variables. We can inter- 
pret Z; as the hidden label of the cluster to which X; belongs. 

Typically, each ø% in (4.31) is assumed to be known up to some parameter vector 77,. It 
is customary! in clustering analysis to work with Gaussian mixtures; that is, each density 
$k is Gaussian with some unknown expectation vector u, and covariance matrix L,. We 
gather all unknown parameters, including the weights {w;}, into a parameter vector 8. As 


usual, T = {x),...,X,} denotes the outcome of T. As the components of 7 are iid, their 
(joint) pdf is given by 
n n K 
g(t16):= | [c10 = | |X we dlxilay Eo. (4.32) 
i=1 


i=l k=1 


Following the same reasoning as for (4.5), we can estimate 0 from an outcome t by max- 
imizing the log-likelihood function 


n n K 
Kolt := X mge: = > mp) we Pei mozo); (4.33) 
i=1 k=1 


i=1 


However, finding the maximizer of (0 | T) is not easy in general, since the function is typ- 
ically multiextremal. 


E Example 4.4 (Clustering via Mixture Models) The data depicted in Figure 4.4 con- 
sists of 300 data points that were independently generated from three bivariate normal 
distributions, whose parameters are given in that same figure. For each of these three dis- 
tributions, exactly 100 points were generated. Ideally, we would like to cluster the data into 
three clusters that correspond to the three cases. 

To cluster the data into three groups, a possible model for the data is to assume that 
the points are 1id draws from an (unknown) mixture of three 2-dimensional Gaussian dis- 
tributions. This is a sensible approach, although in reality the data were not simulated 
in this way. It is instructive to understand the difference between the two models. In the 
mixture model, each cluster label Z takes the value {1,2,3} with equal probability, and 
hence, drawing the labels independently, the total number of points in each cluster would 





'Other common mixture distributions include Student t and Beta distributions. 
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cluster mean vector covariance matrix 





—4 
O p 
0.5 
- a | —0.95 1 
-4 Ea 4 1.5 2 01 
-3 0.1 0.1 
Figure 4.4: Cluster the 300 data points (left) into three clusters, without making any as- 


sumptions about the probability distribution of the data. In fact, the data were generated 
from three bivariate normal distributions, whose parameters are listed on the right. 


2 14 
1.4 1.5 


[oss “i 











be Bin(300, 1/3) distributed. However, in the actual simulation, the number of points in 
each cluster is exactly 100. Nevertheless, the mixture model would be an accurate (al- 
though not exact) model for these data. Figure 4.5 displays the “target” Gaussian mixture 
density for the data in Figure 4.4; that is, the mixture with equal weights and with the exact 
parameters as specified in Figure 4.4. 


0.4 


0.2 





Figure 4.5: The target mixture density for the data in Figure 4.4. 


In the next section we will carry out the clustering by using the EM algorithm. E 


4.5.2 EM Algorithm for Mixture Models 


As we saw in Section 4.3, instead of maximizing the log-likelihood function (4.33) directly 
from the data T = {x,,...,x,}, the EM algorithm first augments the data with the vector of 
latent variables — in this case the hidden cluster labels z = {z),...,Z,}. The idea is that T is 


DATA 
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COMPLETE-DATA 
LOG-LIKELIHOOD 


only the observed part of the complete random data (7, Z), which were generated via the 
two-step procedure described above. That is, for each data point X, first draw the cluster 
label Z € {1,..., K} according to probabilities {w;,...,wx} and then, given Z = z, draw X 
from ¢,. The joint pdf of 7 and Z is 


g(r,z|8) =| | wa ga), 
i=l 


which is of a much simpler form than (4.32). It follows that the complete-data log- 
likelihood function 


KOIT, z) = X niwa 6. (xi) (4.34) 
i=1 


is often easier to maximize than the original log-likelihood (4.33), for any given (r, z). But, 
of course the latent variables z are not observed and so /(@| 7, z) cannot be evaluated. In the 
E-step of the EM algorithm, the complete-data log-likelihood is replaced with the expect- 
ation E, /(@|t, Z), where the subscript p in the expectation indicates that Z is distributed 
according to the conditional pdf of Z given T = 7; that is, with pdf 


P(Z) = 8(z |T, 4) œ g(t, Z|). (4.35) 


Note that p(z) is of the form p1(Z1) +++: Pn(Zn) so that, given T = T, the components of Z are 
independent of each other. The EM algorithm for mixture models can now be formulated 
as follows. 


Algorithm 4.5.1: EM Algorithm for Mixture Models 
input: Data 7, initial guess 0. 
output: Approximation of the maximum likelihood estimate. 
ite 1 
2 while a stopping criterion is not met do 
3 | Expectation Step: Find p(z) := g(z|7,0")) and Q@) := E o0 |r, Z). 
4 | Maximization Step: Let 0 — argmax, O(@). 
5 t—t+l 


6 return 0” 


A possible termination condition is to stop when 0° IT) — 1a? | r)| / |0” | r)| <€ 
for some small tolerance € > 0. As was mentioned in Section 4.3, the sequence of log- 
likelihood values does not decrease with each iteration. Under certain continuity con- 
ditions, the sequence {8} is guaranteed to converge to a local maximizer of the log- 
likelihood /. Convergence to a global maximizer (if it exists) depends on the appropriate 
choice for the starting value. Typically, the algorithm is run from different random starting 
points. 

For the case of Gaussian mixtures, each ¢, = ¢(: | Hy Xx), k = 1,...,K is the density 
of a d-dimensional Gaussian distribution. Let 6" be the current guess for the optimal 
parameter vector, consisting of the weights {we}, mean vectors iY}, and covariance 
matrices {Z\""”}. We first determine p® — the pdf of Z conditional on J = t — for the 
given guess 0”. As mentioned before, the components of Z given T = q are independent, 
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so it suffices to specify the discrete pdf, p° say, of each Z; given the observed point X; = x;. 
The latter can be found from Bayes’ formula: 


PO ew ael Ee?) kS hek (4.36) 


Next, in view of (4.34), the function Q® (0) can be written as 
0 (0) = Epo >» (In Wz; + In z (Xi | Hz,» £z)) = » Epo [in Wz; + In $z (Xi | Hz; £z,)] 3 
i=l i=l 
where the {Z;} are independent and Z; is distributed according to p” in (4.36). This com- 
pletes the E-step. In the M-step we maximize Q with respect to the parameter 0; that is, 
with respect to the {wx}, {u,}, and {2}. In particular, we maximize 


n K 
X S p2 [In we + In bey ZO] 


i=l k=1 


under the condition ), wg = 1. Using Lagrange multipliers and the fact that X},$; p(k) =1 
gives the solution for the {wz}: 


1 n 
wm=- pb, k= 1K. (4.37) 
n i=1 


The solutions for u, and £; now follow from maximizing >)" | p(k) In 6j(x; | Hy, Ly), lead- 
ing to 
Dap Wr 


=. ha ia (4.38) 
k n 
paw p(k) 


and 


n (t) T 
-1 D; (k) (x; - = 
y = Sipi Hi pig, (4.39) 


Zi Py) 

which are very similar to the well-known formulas for the MLEs of the parameters of a 
Gaussian distribution. After assigning the solution parameters to @ and increasing the 
iteration counter ¢ by 1, the steps (4.36), (4.37), (4.38), and (4.39) are repeated until con- 
vergence is reached. Convergence of the EM algorithm is very sensitive to the choice of 
initial parameters. It is therefore recommended to try various different starting conditions. 
For a further discussion of the theoretical and practical aspects of the EM algorithm we 
refer to [85]. 


E Example 4.5 (Clustering via EM) We return to the data in Example 4.4, depicted in 
Figure 4.4, and adopt the model that the data is coming from a mixture of three bivariate 
Gaussian distributions. 

The Python code below implements the EM procedure described in Algorithm 4.5.1. 
The initial mean vectors {,} of the bivariate Gaussian distributions are chosen (from visual 
inspection) to lie roughly in the middle of each cluster, in this case [—2, —3]', [—4, 1]", and 
[0, -1]'. The corresponding covariance matrices are initially chosen as identity matrices, 
which is appropriate given the observed spread of the data in Figure 4.4. Finally, the initial 
weights are 1/3, 1/3, 1/3. For simplicity, the algorithm stops after 100 iterations, which in 
this case is more than enough to guarantee convergence. The code and data are available 
from the book’s website in the GitHub folder Chapter4. 


140 4.5. Clustering via Mixture Models 





EMclust.py 


import numpy as np 
from scipy.stats import multivariate_normal 


Xmat = np.genfromtxt('clusterdata.csv', delimiter=',') 
K = 3 
n, D = Xmat.shape 


np.array([[1/3,1/3,1/3]]) 

= np.array([[-2.0,-4,0],[-3,1,-1]], dtype=np. float32) 
Note that if above *all* entries were written as integers, M would 
be defined to be of integer type, which will give the wrong answer 


= np.zeros((3,2,2)) 


= 1 
1 


np. zeros ((3,300)) 
i in range(0,100): 


#E-step 
for k in range(0,K): 
mvn = multivariate_normal( M[:,k].T, C[k,:,:] ) 
p[k,:] = W[0,k]*mvn. pdf (Xmat) 


# M-Step 
p (p/sum(p,9)) #normalize 
W np.mean(p,1).reshape(1, 3) 


for k in range(0,K): 
M[:,k] = CXmat.T @ p[k,:].T)/sum(p[k,:]) 
xm = Xmat.T - M[:,k].reshape(2,1) 
C{k,:,:] = xm @ Cxm*p[k,:]).T/sum(p[k,:]) 





The estimated parameters of the mixture distribution are given on the right-hand side 
of Figure 4.6. After relabeling of the clusters, we can observe a close match with the 
parameters in Figure 4.4. 

The ellipses on the left-hand side of Figure 4.6 show a close match between the 95% 
probability ellipses? of the original Gaussian distributions (in gray) and the estimated ones. 
A natural way to cluster each point x; is to assign it to the cluster k for which the conditional 
probability p;(k) is maximal (with ties resolved arbitrarily). This gives the clustering of the 
points into red, green, and blue clusters in the figure. 





For each mixture component, the contour of the corresponding bivariate normal pdf is shown that en- 
closes 95% of the probability mass. 
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Figure 4.6: The results of the EM clustering algorithm applied to the data depicted in 
Figure 4.4. 


As an alternative to the EM algorithm, one can of course use continuous multiextremal 
optimization algorithms to directly optimize the log-likelihood function /(@|7) = In g(t | 0) 
in (4.33) over the set © of all possible 8. This is done for example in [15], demonstrating 
superior results to EM when there are few data points. Closer investigation of the likelihood 
function reveals that there is a hidden problem with any maximum likelihood approach for 
clustering if © is chosen as large as possible — i.e., any mixture distribution is possible. To 
demonstrate this problem, consider Figure 4.7, depicting the probability density function, 
g(-|@) of a mixture of two Gaussian distributions, where 0 = [w,/11,07,42,05]" is the 
vector of parameters for the mixture distribution. The log-likelihood function is given by 
l(@|T) = oy In g(x;|9), where x,,..., x4 are the data (indicated by dots in the figure). 
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Figure 4.7: Mixture of two Gaussian distributions. 


It is clear that by fixing the mixing constant w at 0.25 (say) and centering the first 
cluster at xı, one can obtain an arbitrarily large likelihood value by taking the variance of 
the first cluster to be arbitrarily small. Similarly, for higher dimensional data, by choosing 
“point” or “line” clusters, or in general “degenerate” clusters, one can make the value of 
the likelihood infinite. This is a manifestation of the familiar overfitting problem for the 


142 


4.6. Clustering via Vector Quantization 





MANHATTAN 
DISTANCE 


MAXIMUM 
DISTANCE 


HAMMING 
DISTANCE 


training loss that we already encountered in Chapter 2. Thus, the unconstrained maximiza- 
tion of the log-likelihood function is an ill-posed problem, irrespective of the choice of the 
optimization algorithm! 

Two possible solutions to this “overfitting” problem are: 


1. Restrict the parameter set © in such a way that degenerate clusters (sometimes called 
spurious clusters) are not allowed. 

2. Run the given algorithm and if the solution is degenerate, discard it and run the 
algorithm afresh. Keep restarting the algorithm until a non-degenerate solution is 
obtained. 


The first approach is usually applied to multiextremal optimization algorithms and the 
second is used for the EM algorithm. 


4.6 Clustering via Vector Quantization 


In the previous section we introduced clustering via mixture models, as a form of paramet- 
ric density estimation (as opposed to the nonparametric density estimation in Section 4.4). 
The clusters were modeled in a natural way via the latent variables and the EM algorithm 
provided a convenient way to assign the cluster members. In this section we consider a 
more heuristic approach to clustering by ignoring the distributional properties of the data. 
The resulting algorithms tend to scale better with the number of samples n and the dimen- 
sionality d. 

In mathematical terms, we consider the following clustering (also called data segment- 
ation) problem. Given a collection tT = {x,,...,x,} of data points in some d-dimensional 
space X, divide this data set into K clusters (groups) such that some loss function is min- 
imized. A convenient way to determine these clusters is to first divide up the entire space 
X, using some distance function dist(-,-) on this space. A standard choice is the Euclidean 
(or L2) distance: 


d 
dist, x’) = Ilx = x’Il = 4] ) Gi - x). 
i=1 


Other commonly used distance measures on R? include the Manhattan distance: 


d 
> bi xil 
i=l 


and the maximum distance: 


On the set of strings of length d, an often-used distance measure is the Hamming distance: 


d 


` Lix # x}, 


i=l 


that is, the number of mismatched characters. For example, the Hamming distance between 
010101 and 011010 is 4. 
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We can partition the space X into regions as follows: First, we choose K points 
C1, ..., Cg Called cluster centers or source vectors. For each k = 1,..., K, let 


Ry = {x € X : dist(x, cz) < dist(x, c;) for all i + k} 


be the set of points in X that lie closer to c, than any other center. The regions or cells 
{R,} divide the space X into what is called a Voronoi diagram or a Voronoi tessellation. 
Figure 4.8 shows a Voronoi tessellation of the plane into ten regions, using the Euclidean 
distance. Note that here the boundaries between the Voronoi cells are straight line seg- 
ments. In particular, if cell R; and R; share a border, then a point on this border must satisfy 
|x — e;|| = ||x — c ;ll; that is, it must lie on the line that passes through the point (c; + c;)/2 
(that is, the midway point of the line segment between c; and c;) and be perpendicular to 
Cj — Cj. 

















-2 0 2 4 


Figure 4.8: A Voronoi tessellation of the plane into ten cells, determined by the (red) cen- 
ters. 


Once the centers (and thus the cells {R;}) are chosen, the points in t can be clustered 
according to their nearest center. Points on the boundary have to be treated separately. This 
is a moot point for continuous data, as generally no data points will lie exactly on the 
boundary. 

The main remaining issue is how to choose the centers so as to cluster the data in some 
optimal way. In terms of our (unsupervised) learning framework, we wish to approximate 


a vector x via one of c),...,¢€x, using a piecewise constant vector-valued function 
K 
g(x|C):= ) ce Ux € Ri), 
k=l 
where C is the d x K matrix [c,,...,¢x]. Thus, g(x |C) = c when x falls in region R, (we 


ignore ties). Within this class G of functions, parameterized by C, our aim is to minimize 
the training loss. In particular, for the squared-error loss, Loss(x, x’) = ||x—x’||’, the training 


loss is 
K 


1x 1 
alEO) = A legls, D>) e- edP. (4.40) 
i=1 


k=1 XERKNTp 


Thus, the training loss minimizes the average squared distance between the centers. This 
framework also combines both the encoding and decoding steps in vector quantization 


SOURCE VECTORS 


VORONOI 
TESSELLATION 


VECTOR 
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[125]. Namely, we wish to “quantize” or “encode” the vectors in Tt in such a way that each 
vector is represented by one of K source vectors c,,...,¢x, such that the loss (4.40) of this 
representation is minimized. 

Most well-known clustering and vector quantization methods update the vector of cen- 
ters, starting from some initial choice and using iterative (typically gradient-based) proced- 
ures. It is important to realize that in this case (4.40) is seen as a function of the centers, 
where each point x is assigned to the nearest center, thus determining the clusters. It is well 
known that this type of problem — optimization with respect to the centers — is highly 
multiextremal and, depending on the initial clusters, gradient-based procedures tend to 
converge to a local minimum rather than a global minimum. 


4.6.1 K-Means 


One of the simplest methods for clustering is the K-means method. It is an iterative method 
where, starting from an initial guess for the centers, new centers are formed by taking 
sample means of the current points in each cluster. The new centers are thus the centroids 
of the points in each cell. Although there exist many different varieties of the K-means 
algorithm, they are all essentially of the following form: 


Algorithm 4.6.1: K-Means 
input: Collection of points T = {x,,...,x,}, number of clusters K, initial centers 
Ci,...,CK. 
output: Cluster centers and cells (regions). 
1 while a stopping criterion is not met do 
Ri,..., Rg — O (empty sets). 
for i = 1 to n do 
d — [dist(x;, c1), .. . , dist(x;, cg)] // distances to centers 
k <— argmin dj 
Ri — Ry U {x3} // assign x; to cluster k 


7 for k = 1 to K do 
ek x 
IRi| 


a un Aa U N 


8 Che // compute the new center as a centroid of points 








9 return {c;}, {Ri} 


Thus, at each iteration, for a given choice of centers, each point in T is assigned to 
its nearest center. After all points have been assigned, the centers are recomputed as the 
centroids of all the points in the current cluster (Line 8). A typical stopping criterion is to 
stop when the centers no longer change very much. As the algorithm is quite sensitive to 
the choice of the initial centers, it is prudent to try multiple starting values, e.g., chosen 
randomly from the bounding box of the data points. 

We can see the K-means method as a deterministic (or “hard”) version of the probab- 
ilistic (or “soft”) EM algorithm as follows. Suppose in the EM algorithm we have Gaus- 


sian mixtures with a fixed covariance matrix £; = 071, k = 1,...,K, where a” should be 
thought of as being infinitesimally small. Consider iteration t of the EM algorithm. Having 
obtained the expectation vectors we? and weights we k =1,...,K, each point x; is as- 


signed a cluster label Z; according to the probabilities Ph), k =1,...,K given in (4.36). 
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But for 0? — 0 the probability distribution { pP becomes degenerate, putting all its 
probability mass on argmin, ||x; — 4l. This corresponds to the K-means rule of assigning 
x; to its nearest cluster center. Moreover, in the M-step (4.38) each cluster center u? is now 
updated according to the average of the {x;} that have been assigned to cluster k. We thus 
obtain the same deterministic updating rule as in K-means. 


m Example 4.6 (K-means Clustering) We cluster the data from Figure 4.8 via K-means, 
using the Python implementation below. Note that the data points are stored as a 300 x 2 
matrix Xmat. We take the same starting centers as in the EM example: cı = [-2, -3]", c2 = 
[-4,1]', and c3 = [0,—1]". Note also that squared Euclidean distances are used in the 
computations, as these are slightly faster to compute than Euclidean distances (as no square 
root computations are required) while yielding exactly the same cluster center evaluations. 


Kmeans . py 


import numpy as np 
Xmat = np.genfromtxt('clusterdata.csv', delimiter=',') 
K=} 
n, D = Xmat.shape 
c = np.array([[-2.0,-4,0],[-3,1,-1]]) #initialize centers 
cold = np.zeros(c.shape) 
dist2 = np.zeros((K,n)) 
while np.abs(c - cold).sum() > 9.001: 
cold = c.copy() 
for i in range(0,K): #compute the squared distances 
dist2[i,:] = np.sum((Xmat - c[:,i].T)**2, 1) 


label = np.argmin(dist2 ,0) #assign the points to nearest centroid 
minvals = np.amin(dist2 ,0) 
for i in range(0,K): # recompute the centroids 

c[:,i] = np.mean(Xmat[np.where(label == i),:],1).reshape(1,2) 


printC('Loss = {:3.3f£}'.format (minvals.mean())) 


Loss = 2.288 
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Figure 4.9: Results of the K-means algorithm applied to the data in Figure 4.4. The thick 
black circles are the centroids and the dotted lines define the cell boundaries. 
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r= 100 


r= 101 


We found the cluster centers c} = [—1.9286, —3.0416]", c2 = [—3.9237,0.0131]', and 
c3 = [0.5611, —1.2980]", giving the clustering depicted in Figure 4.9. The corresponding 
loss (4.40) was found to be 2.288. E 


4.6.2 Clustering via Continuous Multiextremal Optimization 


As already mentioned, the exact minimization of the loss function (4.40) is difficult to 
accomplish via standard local search methods such as gradient descent, as the function 
is highly multimodal. However, nothing is preventing us from using global optimization 
methods such as the CE or SCO methods discussed in Sections 3.4.2 and 3.4.3. 


E Example 4.7 (Clustering via CE) We take the same data set as in Example 4.6 and 
cluster the points via minimization of the loss (4.40) using the CE method. The Python 
code below is very similar to the code in Example 3.16, except that now we are dealing 
with a six-dimensional optimization problem. The loss function is implemented in the func- 
tion Scluster, which essentially reuses the squared distance computation of the K-means 
code in Example 4.6. The CE program typically converges to a loss of 2.287, correspond- 
ing to the (global) minimizers cı = [—1.9286, —3.0416]',c. = [—3.8681,0.0456]", and 
c3 = [0.5880, —1.3526]", which slightly differs from the local minimizers for the K-means 
algorithm. 























import numpy as np 
np.set_printoptions (precision=4) 


Xmat = np.genfromtxt('clusterdata.csv', delimiter=',') 
Kaas 
n, D = Xmat.shape 


def Scluster(c): 
n, D = Xmat.shape 
dist2 = np.zeros((K,n)) 
cc = c.reshape(D,K) 
for i in range(0,K): 
dist2[i,:] = np.sum((Xmat - cc[:,i].T)**2, 1) 
minvals = np.amin(dist2 ,0) 
return minvals.mean() 


numvar = K*D 

mu = np.zeros(numvar) #initialize centers 
sigma = np.ones(numvar) *2 

rho = 0.1 

N = 500; Nel = int(N*rho); eps = 0.001 


func = Scluster 

best_trj = np.array(numvar) 
best_perf = np.Inf 

trj = np.zeros(shape=(N,numvar) ) 


while(np.max(sigma)>eps): 
for i in range(Q,numvar): 
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trj[:,i] = (np.random.randn(N,1)*sigma[i]+ mu[i]).reshape(N,) 
S = np.zeros(N) 
for i in range(0,N): 

S[i] = func(trj[i]) 


sortedids = np.argsort(S) # from smallest to largest 
S_sorted = S[sortedids] 

best_trj = np.array(n) 

best_perf = np.Inf 

eliteids = sortedids[range(0,Nel)] 

eliteTrj = trj[eliteids,:] 

mu = np.mean(eliteTrj ,axis=0) 

sigma = np.std(CeliteTrj ,axis=0) 


if(best_perf>S_sorted[0]): 
best_perf = S_sorted[0] 
best_trj = trj[sortedids[0]] 


print (best_perf) 
print (best_trj.reshape(2,3)) 


2.2874901831572947 
[[-3.9238 -1.8477 0.5895] 


[ 0.0134 -3.0292 -1.2442]] 





4.7 Hierarchical Clustering 


It is sometimes useful to determine data clusters in a hierarchical manner; an example 
is the construction of evolutionary relationships between animal species. Establishing a 
hierarchy of clusters can be done in a bottom-up or a top-down manner. In the bottom-up 
approach, also called agglomerative clustering, the data points are merged in larger and 
larger clusters until all the points have been merged into a single cluster. In the top-down 
or divisive clustering approach, the data set is divided up into smaller and smaller clusters. 
The left panel of Figure 4.10 depicts a hierarchy of clusters. 


Distance 
wo A 
O O 


pe) 
Oo 


= 
j=) 








7 8 6 1 2 3 4 5 
Labels 





Figure 4.10: Left: a cluster hierarchy of 15 clusters. Right: the corresponding dendrogram. 
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In Figure 4.10, each cluster is given a cluster identifier. At the lowest level are clusters 
comprised of the original data points (identifiers 1,...,8). The union of clusters 1 and 2 
form a cluster with identifier 9, and the union of 3 and 4 form a cluster with identifier 10. 
In turn the union of clusters 9 and 10 constitutes cluster 12, and so on. 

The right panel of Figure 4.10 shows a convenient way to visualize cluster hierarchies 
using a dendrogram (from the Greek dendro for tree). A dendrogram not only summarizes 
how clusters are merged or split, but also shows the distance between clusters, here on the 
vertical axis. The horizontal axis shows which cluster each data point (label) belongs to. 

Many different types of hierarchical clustering can be performed, depending on how 
the distance is defined between two data points and between two clusters. Denote the data 
set by X = {x;,i = 1,...,n}. As in Section 4.6, let dist(x;, x;) be the distance between data 
points x; and x;. The default choice is the Euclidean distance dist(x;, x ;) = ||x; — x;ll. 

Let Z and J be two disjoint subsets of {1,...,}. These sets correspond to two disjoint 
subsets (that is, clusters) of X: {x;,i = J} and {x;,i = J}. We denote the distance between 
these two clusters by d(Z, J). By specifying the function d, we indicate how the clusters 
are linked. For this reason it is also referred to as the linkage criterion. We give a number 
of examples: 


e Single linkage. The closest distance between the clusters. 


dmn, J) := an dist(x;, x;). 


e Complete linkage. The furthest distance between the clusters. 


dmax(L, J) := max dist(x;, x;). 
ieJ, jeJ 


e Group average. The mean distance between the clusters. Note that this depends on 


the cluster sizes. 1 
dws, T) = = dist(x;, x ;). 
at TT 2s Da Sit 


icl jeg 


For these linkage criteria, X is usually assumed to be R? with the Euclidean distance. 

Another notable measure for the distance between clusters is Ward’s minimum vari- 
ance linkage criterion. Here, the distance between clusters is expressed as the additional 
amount of “variance” (expressed in terms of the sum of squares) that would be intro- 
duced if the two clusters were merged. More precisely, for any set K of indices (labels) let 
XK = Vex X~/|K| denote its corresponding cluster mean. Then 


dwaal, T) := X ler- rugl -| Xx- FIP + Do xj- 4.41) 


kelUS ieL JES 


It can be shown (see Exercise 8) that the Ward linkage depends only on the cluster means 
and the cluster sizes for J and J: 


ZITI 


dwaal, T) = Z| + ea 


lær - Xgl. 
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In software implementations, the Ward linkage function is often rescaled by mul- 
tiplying it by a factor of 2. In this way, the distance between one-point clusters {x;} 


and {x ;} is the squared Euclidean distance ||x; — x ale 





Having chosen a distance on X and a linkage criterion, a general agglomerative clus- 
tering algorithm proceeds in the following “greedy” manner. 


Algorithm 4.7.1: Greedy Agglomerative Clustering 
input: Distance function dist, linkage function d, number of clusters K. 
output: The label sets for the tree. 
Initialize the set of cluster identifiers: 7 = {1,...,n}. 
Initialize the corresponding label sets: £; = {i}, i € J. 
Initialize a distance matrix D = [dj;] with dj; = d({i}, {j}). 
fork =n + 1 to 2n — K do 
Find i and j > iin J such that d;; is minimal. 
Create a new label set £y := £L; U L;. 
Add the new identifier k to J and remove the old identifiers i and j from Z. 
Update the distance matrix D with respect to the identifiers i, j, and k. 


o y Ż A a Aà U N = 


9 return L;i = 1,...,2n — K 


Initially, the distance matrix D contains the (linkage) distances between the one-point 
clusters containing one of the data points x1,...,Xn, and hence with identifiers 1,...,n. 
Finding the shortest distance amounts to a table lookup in D. When the closest clusters 
are found, they are merged into a new cluster, and a new identifier k (the smallest positive 
integer that has not yet been used as an identifier) is assigned to this cluster. The old iden- 
tifiers i and j are removed from the cluster identifier set 7. The matrix D is then updated 
by adding a k-th column and row that contain the distances between k and any m € J. This 
updating step could be computationally quite costly if the cluster sizes are large and the 
linkage distance between the clusters depends on all points within the clusters. Fortunately, 
for many linkage functions, the matrix D can be updated in an efficient manner. 


Suppose that at some stage in the algorithm, clusters J and J, with identifiers i and j, 
respectively, are merged into a cluster K = J U J with identifier k. Let M, with identifier 
m, be a previously assigned cluster. An update rule of the linkage distance dym between K 
and M is called a Lance—Williams update if it can be written in the form 


dim = Q dim + Bdim + Y dij + Ô dim — dml, 


where «œ, ...,6 depend only on simple characteristics of the clusters involved, such as the 
number of elements within the clusters. Table 4.2 shows the update constants for a number 
of common linkage functions. For example, for single linkage, dim is the minimal distance 
between J and M, and dj, is the minimal distance between J and M. The smallest of 
these is the minimal distance between K and M. That is, dim = min{din, dim} = dim/2 + 
dim/2 — din — d iml. 


LANCE— 
WILLIAMS 
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Table 4.2: Constants for the Lance—Williams update rule for various linkage functions, 
with 7;, nj, Nm denoting the number of elements in the corresponding clusters. 











Linkage Qa p y 6 
Single 1/2 1/2 0 -1/2 
Complete 1/2 1/2 0 1/2 
. n . 
Group avg. A 4 0 0 
ni + nj ni + nj 
Ni + Nm Nj + Mn Nn 
Ward — 


Ni + Nj +m Ni + Nj +n Ni + Nj +nm 


In practice, Algorithm 4.7.1 is run until a single cluster is obtained. Instead of returning 
the label sets of all 2n — 1 clusters, a linkage matrix is returned that contains the same 
information. At the end of each iteration (Line 8) the linkage matrix stores the merged 
labels i and j, as well as the (minimal) distance d;;. Other information such as the number 
of elements in the merged cluster can also be stored. Dendrograms and cluster labels can be 
directly constructed from the linkage matrix. In the following example, the linkage matrix 
is returned by the method agg_cluster. 


E Example 4.8 (Agglomerative Hierarchical Clustering) The Python code below gives 
a basic implementation of Algorithm 4.7.1 using the Ward linkage function. The methods 
fcluster and dendrogram from the scipy module can be used to identify the labels in 
a cluster and to draw the corresponding dendrogram. 


AggCluster.py 


import numpy as np 
from scipy.spatial.distance import cdist 


def update_distances(D,i,j, sizes): # distances for merged cluster 
n = D.shape[0] 
d = np.inf * np.ones(n+1) 
for k in range(n): # Update distances 
d{k] = CCsizes[i]+sizes[k])*D[i,k] + 
Csizes[j]+sizes[k])*D[j,k] - 
sizes[k]*D[i,j])/Csizes[i] + sizes[j] + sizes[k]) 


infs np.inf * np.ones(n) # array of infinity 


D[i,:],D[:,i1],D[j,:],D[:,j] = infs,infs,infs,infs # deactivate 
new_D = np.inf * np.ones((n+1,n+1)) 

new_D[0:n,0:n] = D # copy old matrix into new_D 

new_D[-1,:], new_D[:,-1] = d,d # add new row and column 

return new_D 


agg_cluster(X): 

n = X.shape[0] 

sizes = np.ones(n) 

D = cdist(X, X,metric = 'sqeuclidean') # initialize dist. matrix 


np.fill_diagonal(D, np.inf * np.ones(D.shape[0])) 
Z = np.zeros((n-1,4)) #linkage matrix encodes hierarchy tree 
for t in range(n-1): 
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i,j = np.unravel_index(D.argmin(), D.shape) # minimizer pair 

sizes = np.append(sizes, sizes[i] + sizes[j]) 

Z[t,:]=np.array([i, J. np.sqrt(D[i,j]l), sizes[-1]]) 

D = update_distances(D, i,j, sizes) # update distance matr. 
return Z 


import scipy.cluster.hierarchy as h 


p< 
l 


= np.genfromtxt('clusterdata.csv',delimiter=',') # read the data 
Z = agg_cluster(X) # form the linkage matrix 


h.dendrogram(Z) # SciPy can produce a dendrogram from Z 
# fcluster function assigns cluster ids to all points based on Z 
cl = h.fcluster(Z, criterion = 'maxclust', t=3) 


import matplotlib.pyplot as plt 
plt.figure(2), plt.clfQ 

cols = ['red','green','blue'] 

colors = [cols[i-1] for i in cl] 
plt.scatter(X[:,0], X[:,1],c=colors) 
plt.show() 





Note that the distance matrix is initialized with the squared Euclidean distance, so that 
the Ward linkage is rescaled by a factor of 2. Also, note that the linkage matrix stores 
the square root of the minimal cluster distances rather than the distances themselves. We 
leave it as an exercise to check that by using these modifications the results agree with the 
linkage method from scipy; see Exercise 9. Oo 


In contrast to the bottom-up (agglomerative) approach to hierarchical clustering, the 
divisive approach starts with one cluster, which is divided into two clusters that are as 
“dissimilar” as possible, which can then be further divided, and so on. We can use the same 
linkage criteria as for agglomerative clustering to divide a parent cluster into two child 
clusters by maximizing the distance between the child clusters. Although it is a natural to try 
to group together data by separating dissimilar ones as far as possible, the implementation 
of this idea tends to scale poorly with n. The problem is related to the well-known max-cut 
problem: given an n X n matrix of positive costs c;;,i, j € {1,...,}, partition the index set 
J =({1,...,n} into two subsets J and K such that the total cost across the sets, that is, 


ye 


JET keK 


is maximal. If instead we maximize according to the average distance, we obtain the group 
average linkage criterion. 


E Example 4.9 (Divisive Clustering via CE) The following Python code is used to di- 
vide a small data set (of size 300) into two parts according to maximal group average link- 
age. It uses a short cross-entropy algorithm similar to the one presented in Example 3.19. 
Given a vector of probabilities {p;,i = 1,...,n}, the algorithm generates an n x n matrix 
of Bernoulli random variables with success probability p; for column i. For each row, the 
Os and Is divide the index set into two clusters, and the corresponding average linkage 


MAX-CUT 
PROBLEM 


rs 110 
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distance is computed. The matrix is then sorted row-wise according to these distances. Fi- 
nally, the probabilities {p;} are updated according to the mean values of the best 10% rows. 
The process is repeated until the {p;} degenerate to a binary vector. This then presents the 
(approximate) solution. 





import numpy as np 

from numpy import genfromtxt 

from scipy.spatial.distance import squareform 
from scipy.spatial.distance import pdist 
import matplotlib.pyplot as plt 


def S(x,D): 
V1 = np.where(x==0) [0] # {V1,V2} is the partition 
V2 = np.where(x==1) [0] 
tmp = D[V1] 
tmp = tmp[:,V2] 
return np.mean(tmp) # the size of the cut 


def maxcut(D,N,eps,rho,alpha): 
n = D.shape[1] 
Ne = int(rho*N) 
p = 1/2*np.ones(n) 
p[0] = 1.0 
while (np.max(np.minimum(p,np.subtract(1,p))) > eps): 
x = np.array(np.random.uniform(0,1,(N,n))<=p, dtype=np.int64) 
sx = np.zeros(N) 
for i in range(N): 
sx[i] = S(x[i],D) 


sortSX = np.flip(np.argsort(sx)) 

#print("gamma = ",sx[sortSX[Ne-1]], " best=",sx[sortSx[0]]) 
elIds = sortSX[0:Ne] 

elites = x[elIds] 

pnew = np.mean(elites, axis=0) 

p = alpha*pnew + (1.0-alpha)*p 


return np.round(p) 


Xmat = genfromtxt('clusterdata.csv', delimiter=',') 
n = Xmat.shape[0] 

D = squareform(pdist (Xmat) ) 

N = 1000 

eps = 10**-2 

rho = 0.1 

alpha = 0.9 


# CE 
pout = maxcut(D,N,eps,rho, alpha); 


cutval = S(pout ,D) 
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printC"cutvalue ",cutval) 

#plot 

V1 = np.where(pout==0) [0] 

xblue = Xmat[V1] 

V2 = np.where(pout==1) [0] 

xred = Xmat[V2] 

plt.scatter(xblue[: ,0],xblue[:,1], c="blue") 
plt.scatter(xred[:,0],xred[:,1], c="red") 


cutvalue 4.625207676517948 
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Figure 4.11: Division of the data in Figure 4.4 into two clusters, via the cross-entropy 
method. 


E 
4.8 Principal Component Analysis (PCA) 
The main idea of principal component analysis (PCA) is to reduce the dimensionality of PRINCIPAL 
a data set consisting of many variables. PCA is a feature reduction (or feature extraction) ae 
mechanism, that helps us to handle high-dimensional data with more features than is con- 
venient to interpret. 
4.8.1 Motivation: Principal Axes of an Ellipsoid 
Consider a d-dimensional normal distribution with mean vector 0 and covariance matrix 
x. The corresponding pdf (see (2.33)) is me 45 


-iyTy! 
z xX' E = xe R?. 


1 
IO = awra 


If we were to draw many iid samples from this pdf, the points would roughly have an 
ellipsoid pattern, as illustrated in Figure 3.1, and correspond to the contours of f: sets of ms 71 
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points x such that x"X~!x = c, for some c > O. In particular, consider the ellipsoid 
xx=], xeR’. (4.42) 
t 373 Let & = BB', where B is for example the (lower) Cholesky matrix. Then, as explained 
[=> 366 in Example A.5, the ellipsoid (4.42) can also be viewed as the linear transformation of 


PRINCIPAL AXES 


SINGULAR VALUE 
DECOMPOSITION 


cS 378 


PRINCIPAL 
COMPONENTS 


mS 362 


d-dimensional unit sphere via matrix B. Moreover, the principal axes of the ellipsoid can 
be found via a singular value decomposition (SVD) of B (or X), see Section A.6.5 and 
Example A.8. In particular, suppose that an SVD of B is 


B = UDV" (note that an SVD of E is then UD?’ U"). 


The columns of the matrix UD correspond to the principal axes of the ellipsoid, and the 
relative magnitudes of the axes are given by the elements of the diagonal matrix D. If some 
of these magnitudes are small compared to the others, a reduction in the dimension of the 
space may be achieved by projecting each point x € R? onto the subspace spanned by the 
main (say k < d) columns of U — the so-called principal components. Suppose without 
loss of generality that the first k principal components are given by the first k columns of 
U, and let U; be the corresponding d x k matrix. 

With respect to the standard basis {e;}, the vector x = x,e; +-+- +x eq is represented by 
the d-dimensional vector [x,,...,xg]'. With respect to the orthonormal basis {u;} formed 
by the columns of matrix U, the representation of x is U'x. Similarly, the projection of 
any point x onto the subspace spanned by the first k principal vectors is represented by the 
k-dimensional vector U; x, with respect to the orthonormal basis formed by the columns of 
U,. So, the idea is that if a point x lies close to its projection U;,U; x, we may represent it via 
k numbers instead of d, using the combined features given by the k principal components. 
See Section A.4 for a review of projections and orthonormal bases. 


E Example 4.10 (Principal Components) Consider the matrix 





14 8 3 
x=/18 5 2J, 
3 2 1 
which can be written as £ = BB", with 
12 3 
B=]0 1 2}. 
00 1 





Figure 4.12 depicts the ellipsoid x "Xx = 1, which can be obtained by linearly transforming 
the points on the unit sphere by means of the matrix B. The principal axes and sizes of the 
ellipsoid are found through a singular value decomposition B = UDV", where U and D are 


0.8460 0.4828 0.2261 4.4027 0 0 
U = |0.4973 -0.5618 -0.6611} and D=) 0 0.7187 0 
0.1922 —0.6718 0.7154 0 0 0.3160 





The columns of U show the directions of the principal axes of the ellipsoid, and the di- 
agonal elements of D indicate the relative magnitudes of the principal axes. We see that 
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the first principal component is given by the first column of U, and the second principal 
component by the second column of U. 

The projection of the point x = [1.052, 0.6648, 0.2271]" onto the 1-dimensional space 
spanned by the first principal component uw; = [0.8460, 0.4972, 0.1922]" is z = ujujx = 
[ 1.0696, 0.6287, 0.2429]. With respect to the basis vector u,, z is represented by the num- 
ber uj z = 1.2643. That is, z = 1.2643). 


1 








Figure 4.12: A “surfboard” ellipsoid where one principal axis is significantly larger than 
the other two. 


4.8.2 PCA and Singular Value Decomposition (SVD) 


In the setting above, we did not consider any data set drawn from a multivariate pdf f. The 
whole analysis rested on linear algebra. In principal component analysis (PCA) we start 
with data x;,...,X,, where each x is d-dimensional. PCA does not require assumptions 
how the data were obtained, but to make the link with the previous section, we can think 
of the data as iid draws from a multivariate normal pdf. 

Let us collect the data in a matrix X in the usual way; that is, 





X11 X12 ... Xd x) 
X21 X22... Xd X3 
X=}... JE]. 
Xn) Xm ++» Xnd Xa 


The matrix X will be the PCA’s input. Under this setting, the data consists of points in d- 
dimensional space, and our goal is to present the data using n feature vectors of dimension 
k<d. 

In accordance with the previous section, we assume that underlying distribution of the 
data has expectation vector 0. In practice, this means that before PCA is applied, the data 
needs to be centered by subtracting the column mean in every column: 


/ — n: — Ye 
Xij = Xij — Xj 


r o a 
where x; = = Dij-1 Xij- 
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ws 362 


tS 357 


We assume from now on that the data comes from a general d-dimensional distribution 
with mean vector 0 and some covariance matrix X. The covariance matrix & is by definition 
equal to the expectation of the random matrix XX”, and can be estimated from the data 
X\,...,X, via the sample average 


a 1 n 1 
X = i T = -EX X. 
n 2 XiX; n 


As Ē is a covariance matrix, we may conduct the same analysis for È as we did for £ in the 
previous section. Specifically, suppose £ = UD?U" is an SVD of È and let U, be the matrix 
whose columns are the k principal components; that is, the k columns of U corresponding 
to the largest diagonal elements in D?. Note that we have used D? instead of D to be com- 
patible with the previous section. The transformation z = UU; x; maps each vector x; € RI 
(thus, with d features) to a vector z; € R@ lying in the subspace spanned by the columns of 
U,. With respect to this basis, the point z; has representation z; = U/ (U,U{x,) = U; x; € Ré 
(thus with k features). The corresponding covariance matrix of the z;,i = 1,...,n is diag- 
onal. The diagonal elements {d¢¢} of D can be interpreted as standard deviations of the data 
in the directions of the principal components. The quantity v = )),_, d;, (that is, the trace of 
D°) is thus a measure for the amount of variance in the data. The proportion d?,/v indicates 
how much of the variance in the data is explained by the £-th principal component. 
Another way to look at PCA is by considering the question: How can we best project the 
data onto a k-dimensional subspace in such a way that the total squared distance between 
the projected points and the original points is minimal? From Section A.4, we know that 
any orthogonal projection to a k-dimensional subspace V, can be represented by a matrix 
UU; , where U, = [u1,...,u,] and the {ue, £ = 1,...,k} are orthogonal vectors of length 1 
that span V;. The above question can thus be formulated as the minimization program: 


j <= Tor. 4.43 
rin, Dl kU; xill (4.43) 


godag 


Now observe that 


1 k : yot n > i - _ 
n Dal UU] xil a xTU,UT)(x; - U,UTx) 


1 n 1 n 1 n k 
=- > lx? - — F x} UU; x; =c- - > > tr(x} uu; Xi) 
MA GESI GEE 
SE -——-" 


Cc 
1 k n k 
=c- — DD Mere Me =c- ) uj Buy, 
E 


{=1 i=l 


where we have used the cyclic property of a trace (Theorem A.1) and the fact that U,U{ 
can be written as a uu; . It follows that the minimization problem(4.43) is equivalent 


to the maximization problem 
k 


ies 4.44 
max Xu Ue (4.44) 


giriyi 
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This maximum can be at most DE d;, and is attained precisely when u1, ..., ug are the 
first k principal components of È. 


E Example 4.11 (Singular Value Decomposition) The following data set consists of in- 
dependent samples from the three-dimensional Gaussian distribution with mean vector 0 
and covariance matrix X given in Example 4.10: 


3.1209 1.7438 0.5479 
—2.6628 -1.5310 —0.2763 
3.7284 3.0648 1.8451 
0.4203 0.3553 0.4268 
—0.7155 —0.6871 —0.1414 
5.8728 4.0180 1.4541 
4.8163 2.4799 0.5637 
2.6948 1.2384 0.1533 
—1.1376 -0.4677 —0.2219 
—1.2452 -0.9942 —0.4449 


X= 


After replacing X with its centered version, an SVD UD?U™ of È = X™X/n yields the 
principal component matrix U and diagonal matrix D: 


—0.8277 0.4613 0.3195 3.3424 0 0 
U =|-0.5300 -0.4556 -0.7152} and D=| 0 0.4778 0 
—0.1843 —0.7613 0.6216 0 0 0.1038 





We also observe that, apart from the sign of the first column, the principal component 
matrix U is similar to that in Example 4.10. Likewise for the matrix D. We see that 97.90% 
of the total variance is explained by the first principal component. Figure 4.13 shows the 
projection of the centered data onto the subspace spanned by this principal component. 





Figure 4.13: Data from the “surfboard” pdf is projected onto the subspace spanned by the 
largest principal component. 


The following Python code was used. 
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PCAdat.py 
import numpy as np 
X np.genfromtxt('pcadat.csv', delimiter=',') 
n X.shape [0] 
X X - X.mean(axis=0) 
Goxa 
U, _ , _ = np.linalg.svd(G/n) 
# projected points 
Y = X @ np.outer(U[: ,0],U[: ,0]) 
import matplotlib.pyplot as plt 
from mpl_toolkits.mplot3d import Axes3D 
fig = plt.figure() 
= fig.add_subplot(111, projection='3d') 
.wW_Xaxis.set_pane_color((0, 0, 9, 0)) 
-plot¢(Y[:,0], Y[:,1], Y[:,2], c='k', linewidth=1) 
.scatter(X[:,0], X[:,1], X[:,2], c='b') 
Scatter CY i Al le bi Vibes 25 c= a”) 
for i in range(n): 
ax.plot([X[i,0], ¥Li,0]], ([X£i,1],¥[2,1]], (X0i,2],¥(i,2]], "b*‘) 
ax.set_xlabel(' 
ax.set_ylabel(' 
ax.set_zlabel(' 
plt.show() 
E 
Next is an application of PCA to Fisher’s famous iris data set, already mentioned in 
IS 2 Section 1.1, and Exercise 1.5. 
E Example 4.12 (PCA for the Iris Data Set) The iris data set contains measurements 
on four features of the iris plant: sepal length and width, and petal length and width, for a 
total of 150 specimens. The full data set also contains the species name, but for the purpose 
of this example we ignore it. 
t 17 Figure 1.9 shows that there is a significant correlation between the different features. 


Can we perhaps describe the data using fewer features by taking certain linear combin- 
ations of the original features? To investigate this, let us perform a PCA, first centering 
the data. The following Python code implements the PCA. It is assumed that a CSV file 
irisX.csv has been made that contains the iris data set (without the species information). 






import seaborn as sns, numpy as np 
np.set_printoptions (precision=4) 







X = np.genfromtxt('IrisX.csv',delimiter=',') 
n X. shape [0] 





II 
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X = X - np.mean(X, axis=0) 


[U,D2,UT]J= np.linalg.svd((X.T @ X)/n) 
print U = \n", Us) print Nn diag(DA2) = +, D2) 


z= U- OT @ XT 


sns.kdeplot(z, bw=0.15) 


[[-0.3614 -0.6566 0.582 0.3155] 
[ 0.0845 -0.7302 -0.5979 -0.3197] 
[-0.8567 0.1734 -0.0762 -0.4798] 


[-0.3583 0.0755 -0.5458 0.7537]] 





diag(D^2) = [4.2001 0.2411 0.0777 0.0237] 


The output above shows the principal component matrix (which we called U) as well as 
the diagonal of matrix D?. We see that a large proportion of the variance, 4.2001 /(4.2001 + 
0.241 1 +0.0777 +0.0237) = 92.46%, is explained by the first principal component. Thus, it 
makes sense to transform each data point x € R* to u7x € R. Figure 4.14 shows the kernel 
density estimate of the transformed data. Interestingly, we see two modes, indicating at 
least two clusters in the data. 


0.6 p 





kernel density estimate 








-4 -3 -2 -1 0 1 2 3 4 
PCA-combined data 


Figure 4.14: Kernel density estimate of the PCA-combined iris data. 


Further Reading 


Various information-theoretic measures to quantify uncertainty, including the Shannon en- 
tropy and Kullback—Leibler divergence, may be found in [28]. The Fisher information, the 
prominent information measure in statistics, is discussed in detail in [78]. Akaike’s inform- 
ation criterion appeared in [2]. The EM algorithm was introduced in [31] and [85] gives an 
in-depth treatment. Convergence proofs for the EM algorithm may be found in [19, 128]. 
A classical reference on kernel density estimation is [113], and [14] is the main reference 
for the theta kernel density estimator. Theory and applications on finite mixture models 
may be found in [86]. For more details on clustering applications and algorithms as well 
as references on data compression, vector quantization, and pattern recognition, we refer 
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to [1, 35, 107, 125]. A useful modification of the K-means algorithm is the fuzzy K-means 
algorithm; see, e.g., [9]. A popular way to choose the starting positions in K-means is given 
by the K-means++ heuristic, introduced in [4]. 


Exercises 


1. This exercise is to show that the Fisher information matrix F(@) in (4.8) is equal to the 
matrix H(@) in (4.9), in the special case where f = g(-|0), and under the assumption that 
integration and differentiation orders can be interchanged. 


(a) Let h be a vector-valued function and k a real-valued function. Prove the following 





QUOTIENT RULE quotient rule for differentiation: 
FOR 
DIFFERENTIATION 
O[h(O)/k(0)] _ 1 dh@) 1 OK) 46)". (4.45) 
00 k(@) 00 k?(0) 06 


(b) Now take h(@) = ake and k(@) = g(X|@) in (4.45) and take expectations with 
respect to Eg on both sides to show that 


1 Ag(X 10) 
-H(0) = sol D 00 | F(@). 


eee, Lm 
A 








(c) Finally show that A is the zero matrix. 


2. Plot the mixture of N(0, 1), U(O, 1), and Exp(1) distributions, with weights w; = w = 
w3 = 1/3. 


3. Denote the pdfs in Exercise 2 by fi, f2, f3, respectively. Suppose that X is simulated via 
the two-step procedure: First, draw Z from {1,2,3}, then draw X from fz. How likely is it 
that the outcome x = 0.5 of X has come from the uniform pdf f2? 


4. Simulate an iid training set of size 100 from the Gamma(2.3,0.5) distribution, and 
implement the Fisher scoring method in Example 4.1 to find the maximum likelihood es- 
timate. Plot the true and approximate pdfs. 


5. Let T = {X,,...,X,} be iid data from a pdf g(x|6@) with Fisher matrix F(@). Explain 
why, under the conditions where (4.7) holds, 


1 n 
S7(@):= - > S(X;|0 
7(0) = — 2 (X;|8) 
for large n has approximately a multivariate normal distribution with expectation vector 0 


and covariance matrix F(@). 


6. Figure 4.15 shows a Gaussian KDE with bandwidth o = 0.2 on the points —0.5,0, 
0.2,0.9, and 1.5. Reproduce the plot in Python. Using the same bandwidth, plot also the 
KDE for the same data, but now with ¢(z) = 1/2,z € [-1, 1]. 
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Figure 4.15: The Gaussian KDE (solid line) is the equally weighted mixture of normal pdfs 
centered around the data and with standard deviation 0 = 0.2 (dashed). 


7. For fixed x’, the Gaussian kernel function 





fald = et 
l V2at 
is the solution to Fourier’s heat equation 
ô 1 8 
apf 19 = zga lD xEeR,t>0, 


with initial condition f(x|0) = 6(x — x’) (the Dirac function at x’). Show this. As a con- 
sequence, the Gaussian KDE is the solution to the same heat equation, but now with initial 
condition f(x|0) = n~! $; 6(x — x;). This was the motivation for the theta KDE [14], 
which is a solution to the same heat equation but now on a bounded interval. 


8. Show that the Ward linkage given in (4.41) is equal to 
ZITI 
I+II 


9. Carry out the agglomerative hierarchical clustering of Example 4.8 via the linkage 
method from scipy.cluster.hierarchy. Show that the linkage matrices are the same. 
Give a scatterplot of the data, color coded into K = 3 clusters. 





dwad, T) = l¥z - Xgl. 


10. Suppose that we have the data T,, = {x1,..., Xn} in R and decide to train the two- 
component Gaussian mixture model 
| x- a 
exp | -———— ], 


2 
205 





_y2 
Wsu- -ERD + 


2 
2r? 207 Ino 


where the parameter vector 0 = [{11, H2, C1, C2, W1, W2]' belongs to the set 
O= 10 : wi +m = lw € [0, 1], pa E€ R, c; > 0, Vi}. 


Suppose that the training is via the maximum likelihood in (2.28). Show that 


n 


sup : X Ing(uil4) = œ, 


eo NET 


In other words, find a sequence of values for 0 € © such that the likelihood grows without 
bound. How can we restrict the set © to ensure that the likelihood remains bounded? 
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11. A d-dimensional normal random vector X ~ N(u,X) can be defined via an affine 
transformation, X = u + X'/*Z, of a standard normal random vector Z ~ N(0,1,), where 
&/?(y!/°)T = È. In a similar way, we can define a d-dimensional Student random vector 
X ~ to(f, X) via a transformation 


1 
X=pt+ wee (4.46) 


where, Z ~ N(O, Ig) and S ~ Gamma(§, $) are independent, œ > 0, and £1/?(2"/?)" = F. 
Note that we obtain the multivariate normal distribution as a limiting case for œ — ov. 


(a) Show that the density of the t,(0, I,) distribution is given by 


T((a + d)/2) Lk 
Ta ( Aa m ) 


By the transformation rule (C.23), it follows that the density of X ~ te(u, X) is given 
by ta x(x — u), where 


talx) := 





1 
ta x(x) := i= gy, 


~ (EHP 
[Hint: conditional on S = s, X has a N(0, I4/s) distribution. ] 
(b) We wish to fit a t,(u, £) distribution to given data T = {x,,...,x,} in Rf via the EM 


method. We use the representation (4.46) and augment the data with the vector $ = 
[S1,...,5,]' of hidden variables. Show that the complete-data likelihood is given by 


(a7/2)2? sO?! exp -ta — HE; - WIP) 
819) =| | ~~" an 


1 


(4.47) 


(c) Show that, as a consequence, conditional on the data t and parameter 0, the hidden 
data are mutually independent, and 


d E2 (x; — wW) 
(SilT, O) ~ Gamma (“= oe). 


= lrei 

(d) At iteration t of the EM algorithm, let g(s) = g(s |t, 0^") be the density of the 
missing data, given the observed data r and the current parameter guess 0". Verify 
that the expected complete-data log-likelihood is given by: 


d 
Byo Ing(r, $16) = Z mZ - mOr) - nn (2) - Fn eI 
+d-2< “a + |E; — wll? 
+ —— 2 Ew In S; = by ZHE EP Eos, 
i=1 i=1 
Show that 
al) + d (t-1) 
Eosi = at) + ||(X-D)=12( x; = p= D)IP2 cas 
Did CD sd = 
Eg InS; = (—*4) = n( 4) + In wt D 


where w := (Iln TY is digamma function. 
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(e) Finally, show that in the M-step of the EM algorithm @ is updated from 0“) as 
follows: 


n (t-1) 
A dieiW; Xi 


n ŒD 
die Wi 


li 
£0 = — Sw PG — pV — 2)", 
Mt i=1 


H 


and a is defined implicitly through the solution of the nonlinear equation: 


t t k Dawe 
inf) -se 224) —-mo( 4) 41 Ra) ) =v a 








2 2 n 


12. A generalization of both the gamma and inverse-gamma distribution is the generalized 


inverse-gamma distribution, which has density GENERALIZED 
INVERSE-GAMMA 
a/byP? ai DISTRIBUTION 
f(s) = (ab p le zlas+b/s) a, b, s> 0, p E R, (4.48) 
2K,(Vab) 


where K, is the modified Bessel function of the second kind, which can be defined as the MopiFiep BEssEL 
r FUNCTION OF THE 
integral ol SECOND KIND 


K,(x) = f e 7° cosh(pt) dt, x>0, pER. (4.49) 
0 


We write S ~ GIG(a, b, p) to denote that S has a pdf of the form (4.48). The function K, 
has many interesting properties. Special cases include 


xn 1l 
Kix) = z e F 
Pek 1 
K3 (x) = 3 © $ + >) 
xz oth. 3 3 
Ks)/2(x) = Va È tat =) 
More generally, K, satisfies the recursion 
2p 
Kp+i(x) = Kp-1(x) + —K,(x). (4.50) 
x 
(a) Using the change of variables e* = s ya/b, show that 
{ gP he 2545/9) ds = 2K, (Wab)(b/ay””. 
0 


(b) Let S ~ GlG(a, b, p). Show that 


ES = Vb Kysi(Vab) (4.51) 
Va K,(Vab) 
and a 
gs! = V2Kpn(vab) 2p. (4.52) 


Vb K,(Vab) b 
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Exercises 





SCALE-MIXTURE 


BESSEL 
DISTRIBUTION 


13. In Exercise 11 we viewed the multivariate Student t, distribution as a scale-mixture 
of the N(0, 1,) distribution. In this exercise, we consider a similar transformation, but now 
x!/?Z ~ NO0, £) is not divided but is multiplied by VS, with S ~ Gamma(a/2, a/2): 


X = p+ VS x! Z, (4.53) 
where S and Z are independent and a > 0. 


(a) Show, using Exercise 12, that for £! = I; and u = 0, the random vector X has a 
d-dimensional Bessel distribution, with density: 


Q l-(a+d)/2 g(ard)/4 Ilx] 672/2 


PT (a2) Ke-a(llxll Væ), x ER, 


Ka(x) := 
where K, is the modified Bessel function of the second kind given in (4.49). We write 
X ~ Bessel,(0,1,). A random vector X is said to have a Bessel, (u, £) distribution if 
it can be written in the form (4.53). By the transformation rule (C.23), its density is 
given by gE (x — u)). Special instances of the Bessel pdf include: 





wo) = PCV I) 
° v2 
K4(x) = aei exp(—2 |x|) 


1 
K4(X1, X2, X3) = : exp (-2 fxn +35 + £) 
(d+ 1)/2)4? va d 
= ~ oM —Vd+1 : R*. 
Ki) = OE gah] MP” Va + l l), xe 
Note that kz is the (scaled) pdf of the double-exponential or Laplace distribution. 


(b) Given the data t = {x),...,x,} in Rf, we wish to fit a Bessel pdf to the data by 
employing the EM algorithm, augmenting the data with the vector S = [$),...,S,]" 
of missing data. We assume that œ is known and a > d. Show that conditional on 
T (and given @), the missing data vector S has independent components, with S; ~ 
GlG(a, b;, (a — d)/2), with b; := |E œ; — wl’, i= 1,...,n. 

(c) At iteration t of the EM algorithm, let g(s) = g(s|t,0°) be the density of the 
missing data, given the observed data 7 and the current parameter guess 0". Show 
that the expected complete-data log-likelihood is given by: 


1 n 
O(8) := E» In g(t, 8 |6) = -5 D bi(@)w®? + constant, (4.54) 
i=l 
where b,(@) = ||Z~!/?(x; — 9|? and 
(-1) VE Kie-av2y a y a b0) amd ae 
eee a ae 
Vb(0") Ka-ap( Va b0?) ) bo?) 


From (4.54) derive the M-step of the EM algorithm. That is, show how 6 is updated 
from 0”, 


(d 


wa 


Chapter 4. Unsupervised Learning 


165 





14. Consider the ellipsoid E = {x € Rf : xX7!x = 1} in (4.42). Let UD?UT be an SVD of 
E. Show that the linear transformation x > U"D™!x maps the points on E onto the unit 
sphere {z € R?: |lz|| = 1}. 


15. Figure 4.13 shows how the centered “surfboard” data are projected onto the first 
column of the principal component matrix U. Suppose we project the data instead onto 
the plane spanned by the first two columns of U. What are a and b in the representation 
axı + bx = x3 of this plane? 


16. Figure 4.14 suggests that we can assign each feature vector x in the iris data set to 
one of two clusters, based on the value of uj x, where u; is the first principal component. 
Plot the sepal lengths against petal lengths and color the points for which uj x < 1.5 differ- 
ently to points for which u; x > 1.5. To which species of iris do these clusters correspond? 
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CHAPTER 5 





REGRESSION 





Many supervised learning techniques can be gathered under the name “regression”. 
The purpose of this chapter is to explain the mathematical ideas behind regression 
models and their practical aspects. We analyze the fundamental linear model in detail, 
and also discuss nonlinear and generalized linear models. 


5.1 Introduction 


Francis Galton observed in an article in 1889 that the heights of adult offspring are, on the 
whole, more “average” than the heights of their parents. Galton interpreted this as a degen- 
erative phenomenon, using the term “regression” to indicate this “return to mediocrity”. 
Nowadays, regression refers to a broad class of supervised learning techniques where the 
aim is to predict a quantitative response (output) variable y via a function g(x) of an ex- 
planatory (input) vector x = [x),...,x,]', consisting of p features, each of which can be 
continuous or discrete. For instance, regression could be used to predict the birth weight of 
a baby (the response variable) from the weight of the mother, her socio-economic status, 
and her smoking habits (the explanatory variables). 

Let us recapitulate the framework of supervised learning established in Chapter 2. The 
aim is to find a prediction function g that best guesses! what the random output Y will be 
for a random input vector X. The joint pdf f(x,y) of X and Y is unknown, but a training 
set T = {(X1,Y1),--+5 (Xn, Yn)} is available, which is thought of as the outcome of a random 
training set T = {(X1, Y1),...,(Xn, Yn)} of iid copies of (X, Y). Once we have selected a 
loss function Loss(y, y), such as the squared-error loss 


Loss, y) = (y - 9)’, (5.1) 


then the “best” prediction function g is defined as the one that minimizes the risk f(g) = 
ELoss(Y, g(X)). We saw in Section 2.2 that for the squared-error loss this optimal predic- 
tion function is the conditional expectation 


g(x) = E[Y|X =x]. 
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'Recall the mnemonic use of “g” for “guess” 
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5.1. Introduction 





LEARNER 


rs 21 


As the squared-error loss is the most widely-used loss function for regression, we will 
adopt this loss function in most of this chapter. 

The optimal prediction function g* has to be learned from the training set tT by minim- 
izing the training loss 


1 n 
£-(8) = — D0i- sd? (5.2) 
i=] 


over a Suitable class of functions G. Note that in the above definition, the training set T is 
assumed to be fixed. For a random training set 7, we will write the training loss as f(g). 
The function g that minimizes the training loss is the function we use for prediction — 
the so-called learner. When the function class G is clear from the context, we drop the 
superscript in the notation. 

As we already saw in (2.2), conditional on X = x, the response Y can be written as 


Y = g'(x) + e(x), 


where E e(x) = 0. This motivates a standard modeling assumption in supervised learn- 
ing, in which the responses Yj,..., Y,, conditional on the explanatory variables X; = 
X1,...,X, = Xn, are assumed to be of the form 


Y; = g(x) + £i, i=1,...,n, 


where the {¢;} are independent with E s; = 0 and Var s; = o? for some function g € G and 
variance o-. The above model is usually further specified by assuming that g is completely 
known up to an unknown parameter vector; that is, 


Y; = g(x; |B) + £i, i=1,...,n. (5.3) 


While the model (5.3) is described conditional on the explanatory variables, it will be 
convenient to make one further model simplification, and view (5.3) as if the {x;} were 
fixed, while the {Y;} are random. 


For the remainder of this chapter, we assume that the training feature vectors {x;} are 


fixed and only the responses are random; that is, 7 = {(x1, Y\),..., (Xn, Yn)}- 





The advantage of the model (5.3) is that the problem of estimating the function g from 
the training data is reduced to the (much simpler) problem of estimating the parameter 
vector P. An obvious disadvantage is that functions of the form g(- |) may not accurately 
approximate the true unknown g*. The remainder of this chapter deals with the analysis 
of models of the form (5.3). In the important case where the function g(- |£) is linear, the 
analysis proceeds through the class of linear models. If, in addition, the error terms {¢;} are 
assumed to be Gaussian, this analysis can be carried out using the rich theory of normal 
linear models. 
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5.2 Linear Regression 


The most basic regression model involves a linear relationship between the response and a 
single explanatory variable. In particular, we have measurements (x1, y1),..., (Xn, Yn) that 
lie approximately on a straight line, as in Figure 5.1. 


15y 











Figure 5.1: Data from a simple linear regression model. 


Following the general scheme captured in (5.3), a simple model for these data is that 
the {x;} are fixed and variables {Y;} are random such that 


Y; = po +i Xxi+ ep t=1,...,n, (5.4) 


for certain unknown parameters Bo and 61. The {¢;} are assumed to be independent with 
expectation 0 and unknown variance o. The unknown line 


y=Pot+ Pix (5.5) 
— 
g(x|B) 
is called the regression line. Thus, we view the responses as random variables that would REGRESSION LINE 
lie exactly on the regression line, were it not for some “disturbance” or “error” term repres- 


ented by the {¢;}. The extent of the disturbance is modeled by the parameter o°. The model 
in (5.4) is called simple linear regression. This model can easily be extended to incorporate SIMPLE LINEAR 


more than one explanatory variable, as follows. saa Vie 
Definition 5.1: Multiple Linear Regression Model 
In a multiple linear regression model the response Y depends on a d-dimensional MULTIPLE LINEAR 
explanatory vector x = [x),...,Xq]', via the linear relationship ee 


Y = bBo + b1 x1 +--+ baxa t+, (5.6) 


where E € = 0 and Var £ = 0”. 
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MODEL MATRIX 


Thus, the data lie approximately on a d-dimensional affine hyperplane 


y = Po + Bix, + +++ + Baxa, 
a 
g(x|8) 


where we define B = [6o, 81, - - - , Bal" . The function g(x |B) is linear in 2, but not linear in 
the feature vector x, due to the constant 6). However, augmenting the feature space with 
the constant 1, the mapping [1,x']' > g(x|B) := [1,x'] becomes linear in the feature 
space and so (5.6) becomes a linear model (see Section 2.1). Most software packages for 
regression include | as a feature by default. 

Note that in (5.6) we only specified the model for a single pair (x, Y). The model for the 
training set T = {(x1, Y1), ..., (Xn, Y,)} is simply that each Y; satisfies (5.6) (with x = x;) 
and that the {Y;} are independent. Setting Y = [Y,,...,Y,]', we can write the multiple 
linear regression model for the training data compactly as 





Y = XB +e, (5.7) 
where € = [€),...,&,]' is a vector of iid copies of € and X is the model matrix given by 
l xu Sig e Aig 1 xi 
fl xa x2 oo Xa] |l x 
| Xam Xm e ee 1 x 


E Example 5.1 (Multiple Linear Regression Model) Figure 5.2 depicts a realization of 
the multiple linear regression model 


Y; = xa +x +e; i=1,...,100, 


where £1, ..., E100 ~iia N(0, 1/16). The fixed feature vectors (vectors of explanatory vari- 
ables) x; = [%j1, x2]", i = 1,..., 100 lie in the unit square. 


Te 











Figure 5.2: Data from a multiple linear regression model. 
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5.3 Analysis via Linear Models 


Analysis of data from a linear regression model is greatly simplified through the linear 
model representation (5.7). In this section we present the main ideas for parameter estima- 
tion and model selection for a general linear model of the form 


Y=XBte, (5.8) 


where X is an nx p matrix, B = [6),...,8,]' a vector of p parameters, and € = [€),...,&,]" 
an n-dimensional vector of independent error terms, with Ee; = 0 and Vars; = o°, i = 
1,...,n. Note that the model matrix X is assumed to be fixed, and Y and € are random. A 
specific outcome of Y is denoted by y (in accordance with the notation in Section 2.8). 


Note that the multiple linear regression model in (5.7) was defined using a different 
parameterization; in particular, there we used B = [f, £1, ..., Ba)". So, when apply- 


ing the results in the present section to such models, be aware that p = d + 1. Also, 
in this section a feature vector x includes the constant 1, so that X' = [x,,...,X,]. 





5.3.1 Parameter Estimation 


The linear model Y = Xf + €e contains two unknown parameters, B and o”, which have 
to be estimated from the training data t. To estimate 6, we can repeat exactly the same 
reasoning used in our recurring polynomial regression Example 2.1 as follows. For a linear 
prediction function g(x) = x'B, the (squared-error) training loss can be written as 


1 
f(g) = — lb - Xl, 


and the optimal learner g, minimizes this quantity, leading to the least-squares estimate B, 
which satisfies the normal equations 


X'XB=X'y. (5.9) 


The corresponding training loss can be taken as an estimate of o°; that is, 
=z l Bi 
= NY ees (5.10) 


To justify the latter, note that a? is the second moment of the model errors ¢;,i = 1,...,7, 
in (5.8) and could be estimated via the method of moments (see Section C.12.1) using the 
sample average n`! >); €7 = |lel|?/n = ||Y — XSI? /n, if B were known. By replacing £ with 
its estimator, we arrive at (5.10). Note that no distributional properties of the {s;} were used 
other than Ee; = 0 and Vare; = o’,i = 1,...,n. The vector e := y — Xf is called the 
vector of residuals and approximates the (unknown) vector of model errors £. The quantity 
llel? = Xi; e is called the residual sum of squares (RSS). Dividing the RSS by n- p gives 
an unbiased estimate of 0”, which we call the estimated residual squared error (RSE); see 
Exercise 12. 
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In terms of the notation given in the summary Table 2.1 for supervised learning, we 
thus have: 


1. 


2: 


The (observed) training data is T = {X, y}. 


The function class G is the class of linear functions of x; that is G = {g(-|B) : x b 


x'B, B ERP}. 


. The (squared-error) training loss is €,(g(-|)) = Ily — XSI? /n. 
. The learner g, is given by g;(x) = xB, where B = argminger IY — XI. 


. The minimal training loss is €-(g,) = |ly — XBi? /n =o, 


5.3.2 Model Selection and Prediction 


Even if we restrict the learner to be a linear function, there is still the issue of which explan- 
atory variables (features) to include. While including too few features may result in large 
approximation error (underfitting), including too many may result in large statistical error 
(overfitting). As discussed in Section 2.4, we need to select the features which provide the 
best tradeoff between the approximation and statistical errors, so that the (expected) gener- 
alization risk of the learner is minimized. Depending on how the (expected) generalization 
risk is estimated, there are a number of strategies for feature selection: 


1. 


Use test data t’ = (X’, y’) that are obtained independently from the training data 7, 
to estimate the generalization risk E ||Y — g,(X)||° via the test loss (2.7). Then choose 
the collection of features that minimizes the test loss. When there is an abundance of 
data, part of the data can be reserved as test data, while the remaining data is used as 
training data. 


. When there is a limited amount of data, we can use cross-validation to estimate the 


expected generalization risk E ||Y — g7(X)|’ (where J is a random training set), as 
explained in Section 2.5.2. This is then minimized over the set of possible choices 
for the explanatory variables. 


. When one has to choose between many potential explanatory variables, techniques 


such as regularized least-squares and lasso regression become important. Such 
methods offer another approach to model selection, via the regularization (or ho- 
motopy) paths. This will be the topic of Section 6.2 in the next chapter. 


. Rather than using computer-intensive techniques, such as the ones above, one can 


use theoretical estimates of the expected generalization risk, such as the in-sample 
risk, AIC, and BIC, as in Section 2.5, and minimize this to determine a good set of 
explanatory variables. 


. All of the above approaches do not assume any distributional properties of the error 


terms {¢;} in the linear model, other than that they are independent with expectation 
0 and variance o°. If, however, they are assumed to have a normal (Gaussian) distri- 
bution, (that is, {¢;} ~jiq N(O, o07)), then the inclusion and exclusion of variables can 
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be decided by means of hypotheses tests. This is the classical approach to model 
selection, and will be discussed in Section 5.4. As a consequence of the central limit 
theorem, one can use the same approach when the error terms are not necessarily 
normal, provided their variance is finite and the sample size n is large. 


6. Finally, when using a Bayesian approach, comparison of two models can be achieved 
by computing their so-called Bayes factor (see Section 2.9). 


All of the above strategies can be thought of as specifications of a simple rule formu- 
lated by William of Occam, which can be interpreted as: 


When presented with competing models, choose the simplest one that explains 
the data. 


This age-old principle, known as Occam’s razor, is mirrored in a famous quote of Einstein: 
Everything should be made as simple as possible, but not simpler. 


In linear regression, the number of parameters or predictors is usually a reasonable measure 
of the simplicity of the model. 


5.3.3 Cross-Validation and Predictive Residual Sum of Squares 


We start by considering the n-fold cross-validation, also called leave-one-out cross- 
validation, for the linear model (5.8). We partition the data into n data sets, leaving out 
precisely one observation per data set, which we then predict based on the n — 1 remaining 
observations; see Section 2.5.2 for the general case. Let y_; denote the prediction for the 
i-th observation using all the data except y;. The error in the prediction, y; — y_;, is called a 
predicted residual — in contrast to an ordinary residual, e; = y;—y;, which is the difference 
between an observation and its fitted value y; = g,(x;) obtained using the whole sample. In 
this way, we obtain the collection of predicted residuals {y; —‘y_;}_, and summarize them 
through the predicted residual sum of squares (PRESS): 


PRESS = ) 0; -F-D 


i=1 


Dividing the PRESS by n gives an estimate of the expected generalization risk. 

In general, computing the PRESS is computationally intensive as it involves training 
and predicting n separate times. For linear models, however, the predicted residuals can be 
calculated quickly using only the ordinary residuals and the projection matrix P = XX* 
onto the linear space spanned by the columns of the model matrix X (see (2.13)). The i-th 
diagonal element P;; of the projection matrix is called the i-th leverage, and it can be shown 
that O < P; < 1 (see Exercise 10). 
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Theorem 5.1: PRESS for Linear Models 
Proof: It suffices to show that the i-th predicted residual can be written as y; — y_; = 
e;/(1 — pi). Let X_; denote the model matrix X with the i-th row, x, removed, and define 
y_, similarly. Then, the least-squares estimate for p using all but the i-th observation is 
Bi = XTX) X] y Writing XTX = X',X_;+x;x/, we have by the Sherman—Morrison 
t= 371 formula Le ee ae 
(XTX) xix (XTX) 
XTX)! =(K™X)'! + > 
eo) ( ) 1- x] (XTX) lx; 
where x/ (XTX) 'x; = p; < 1. Also, X',y_; = X'y — x;y;. Combining all these identities, 
we have 
Bui = XIX X] Yi 
XTX)! x,x7 (XTX)! 
- (or + LIAO ery no 
l- pi 
~ (X°X)'xx7B XTX) xpi; 
= Bt E aD — (X'X) ‘xy; - A Poi 
l1- Pi 1- Di 
~ (X™X)'xix7B (XTX)! xy, 
l- pi l- pi 
~ AXX x; -P (XTX)! xe; 
= B - ——— C = B - 1. 
l- pi l- pi 
It follows that the predicted value for the i-th observation is given by 
pa as x! X'X -l yje; aia iĉi 
E EE a a 
l- pi l- pi 
Hence, y; — Y-i = e; + piei/(1 — pi) = ei/(1 — pò. o 
r= 26 E Example 5.2 (Polynomial Regression (cont.)) We return to Example 2.1, where we 
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estimated the generalization risk for various polynomial prediction functions using inde- 
pendent validation data. Instead, let us estimate the expected generalization risk via cross- 
validation (thus using only the training set) and apply Theorem 5.1 to compute the PRESS. 
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polyregpress.py 
import numpy as np 
import matplotlib.pyplot as plt 
def generate_data(beta , sig, n): 
u np.random.rand(n, 1) 
y u ** np.arange(®, 4) @ beta.reshape(4,1) + ( 
sig * np.random.randn(n, 1)) 
return u, y 
np.random.seed(12) 
beta = np.array([[10.0, -140, 400, -250]]).T; 
Sag—5s ne — lO 2: 
u,y = generate_data(beta,sig,n) 
X = np.ones((n, 1)) 
K = 12 #maximum number of parameters 
press = np.zeros(K+1) 
for k in range(1,K): 
ifk> I- 
X = np.hstack((X, u**(k-1))) # add column to matrix 
P X @ np.linalg.pinv(X) # projection matrix 
e y -P@y 
press[k] = np.sum((e/(1-np.diag(P).reshape(n,1)))**2) 
plt.plot(press[1:K]/n) 

The PRESS values divided by n = 100 for the constant, linear, quadratic, cubic, and 
quartic order polynomial regression models are, respectively, 152.487, 56.249, 51.606, 
30.999, and 31.634. Hence, the cubic polynomial regression model has the lowest PRESS, 
indicating that it has the best predictive performance. E 
5.3.4 In-Sample Risk and Akaike Information Criterion 
In Section 2.5.1 we introduced the in-sample risk as a measure for the accuracy of the mS 35 
prediction function. To recapitulate, given a fixed data set t with associated response vector 
y and n X p matrix of explanatory variables X, the in-sample risk of a prediction function 
g is defined as 

€in(g) := Ex Loss(Y, g(X)), (5.11) 
where Ex signifies that the expectation is taken under a different probability model, in 
which X takes the values x;,...,%, with equal probability, and given X = x; the random 
variable Y is drawn from the conditional pdf f(y | x;). The difference between the in-sample 
risk and the training loss is called the optimism. For the squared-error loss, Theorem 2.2 ex- iS 36 


presses the expected optimism of a learner gy as two times the average covariance between 
the predicted values and the responses. 

If the conditional variance of the error Y — g*(X) given X = x does not depend on x, 
then the expected in-sample risk of a learner g,, averaged over all training sets, has a simple 
expression: 
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Theorem 5.2: Expected In-Sample Risk for Linear Models 





Proof: The expected optimism is, by definition, Ex[¢in(g7) — €7(g7)] which, for the 
squared-error loss, is equal to 2¢* p/n, using exactly the same reasoning as in Example 2.3. 
Note that here £* = v’. o 


Equation (5.12) is the basis of the following model comparison heuristic: Estimate the 
irreducible risk €* = v? via v?, using a model with relatively high complexity. Then choose 
the linear model with the lowest value of 


lly — XBIÊ + 212p. (5.13) 


We can also use the Akaike information criterion (AIC) as a heuristic for model com- 
parison. We discussed the AIC in the unsupervised learning setting in Section 4.2, but the 
arguments used there can also be applied to the supervised case, under the in-sample model 
for the data. In particular, let Z = (X, Y). We wish to predict the joint density 


1 n 
= Y= Lix=x; i)» 
FO = f) = 5D Means FO NED 
using a prediction function g(z| 0) from a family G := {g(z|0),@ € R1}, where 
1 n 
g(z10) = g(x,y10):= = Y Tar) 8:010). 
n i=1 
Note that q is the number of parameters (typically larger than p for a linear model with a 


n x p design matrix). 
Following Section 4.2, the in-sample cross-entropy risk in this case is 


r(0) := -Ex In g(Z |8), 


and to approximate the optimal parameter 6° we minimize the corresponding training loss 
0 := -+Y nelo) 
r, (0) := —— n g(z; |0). 
j=1 
The optimal parameter ©, for the training loss is thus found by minimizing 


-15 (~inn + mg;0;10). 


j=l 
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That is, it is the maximum likelihood estimate of 8: 
@, = argmax In g;(y; | 4). 
gn 2, Silyi |) 
Under the assumption that f = g(-| 6°) for some parameter 6°, we have from Theorem 4.1 
that the estimated in-sample generalization risk can be approximated as ms 125 
oe a 1# = 
Ex r@,) = r7,@,) + Í = Inn- - X ngj) + t. 
n LEE n 
This leads to the heuristic of selecting the learner g(- |0,) with the smallest value of the 
AIC: a 
-2 $ Ingi: 10,) + 24. (5.14) 
i=l 
E Example 5.3 (Normal Linear Model) For the normal linear model Y ~ N(x" B, o?) 
(see (2.29)), with a p-dimensional vector B, we have IZ 43 





1 ae a 2 
gi(vil b, 0°) = xp oe , i=l,...,n, 
=6 
so that the AIC is DA 
~ — XB? 
nIn(2r) + nline? + = lh +24, (5.15) 
Oo 


where B, T?) is the maximum likelihood estimate and q = p+1 is the number of parameters 
(including o°). For model comparison we may remove the nIn(2z) term if all the models 
are normal linear models. E 


Certain software packages report the AIC without the nIno term in (5.15). This 


may lead to sub-optimal model selection if normal models are compared with non- 
normal ones. 





5.3.5 Categorical Features 


Suppose that, as described in Chapter 1, the data is given in the form of a spreadsheet or 
data frame with n rows and p + 1 columns, where the first element of row i is the response 
variable y;, and the remaining p elements form the vector of explanatory variables xy. 
When all the explanatory variables (features, predictors) are quantitative, then the model 
matrix X can be directly read off from the data frame as the n x p matrix with rows x; ,i = 
1,...,7. 

However, when some explanatory variables are qualitative (categorical), such a one-to- 
one correspondence between data frame and model matrix no longer holds. The solution is 
to include indicator or dummy variables. 
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FACTORIAL 
EXPERIMENTS 


FACTORS 
LEVELS 


INDICATOR 
FEATURE 


Linear models with continuous responses and categorical explanatory variables often 
arise in factorial experiments. These are controlled statistical experiments in which the 
aim is to assess how a response variable is affected by one or more factors tested at several 
levels. A typical example is an agricultural experiment where one wishes to investigate 
how the yield of a food crop depends on factors such as location, pesticide, and fertilizer. 


m Example 5.4 (Crop Yield) The data in Table 5.1 lists the yield of a food crop for four 
different crop treatments (e.g., strengths of fertilizer) on four different blocks (plots). 


Table 5.1: Crop yield for different treatments and blocks. 





Treatment 
Block 1 2 3 4 
1 9.2988 9.4978 9.7604 10.1025 
2 8.2111 8.3387 8.5018 8.1942 
3 9.0688 9.1284 9.3484 9.5086 
4 8.2552 7.8999 8.4859 8.9485 


The corresponding data frame, given in Table 5.2, has 16 rows and 3 columns: one 
column for the crop yield (the response variable), one column for the Treatment, with 
levels 1, 2, 3, 4, and one column for the Block, also with levels 1, 2, 3, 4. The values 1, 
2, 3, and 4 have no quantitative meaning (it does not make sense to take their average, for 
example) — they merely identify the category of the treatment or block. 


Table 5.2: Crop yield data organized as a data frame in standard format. 


Yield Treatment Block 





9.2988 1 1 

8.2111 1 2 

9.0688 1 3 

8.2552 1 4 

9.4978 2 1 

8.3387 2 2 

9.5086 3 

8.9485 4 4 
oO 
In general, suppose there are r factor (categorical) variables u;,...,u,, where the j- 
th factor has p; mutually exclusive levels, denoted by 1,..., pj. In order to include these 


categorical variables in a linear model, a common approach is to introduce an indicator 
feature x = l{u; = k} for each factor j at level k. Thus, x; = 1 if the value of factor j 
is k and 0 otherwise. Since >), 1{u; = k} = 1, it suffices to consider only p; — 1 of these 
indicator features for each factor j (this prevents the model matrix from being rank defi- 
cient). For a single response Y, the feature vector x" is thus a row vector of binary variables 
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that indicates which levels were observed for each factor. The model assumption is that Y 
depends in a linear way on the indicator features, apart from an error term. That is, 


r Pj 


Y=fo+), >) Bw Iuj =k} +e, 


j=1 k=2 
J “ik 


where we have omitted one indicator feature (corresponding to level 1) for each factor j. 
For independent responses Y4, ..., Y, corresponding to the factors {u y 2 jeres lly eee let 


Xijk = LU {uj; = k}. Then, the linear model for the data becomes 


r Pj 


Y; = Bo + >) > Baik + £i (5.16) 


j=l k=2 


where the {s;} are independent with expectation 0 and some variance o. By gathering the 
Bo and {8} into a vector p, and the {x;;,} into a matrix X, we have again a linear model of 
the form (5.8). The model matrix X has n rows and 1 + ¥);_,(p; — 1) columns. Using the 
above convention that the £;; parameters are subsumed in the parameter £o (correspond- 
ing to the “constant” feature), we can interpret So as a baseline response when using the 
explanatory vector x' for which xj; = 1 for all factors j = 1,...,7. The other parameters 
{£ jx} can be viewed as incremental effects relative to this baseline effect. For example, 6,2 
describes by how much the response is expected to change if level 2 is used instead of level 
1 for factor 1. 


E Example 5.5 (Crop Yield (cont.)) In Example 5.4, the linear model (5.16) has eight 
parameters: 6o, (12,813, B14, B22, B23, 824, and o°. The model matrix X depends on how 
the crop yields are organized in a vector y and on the ordering of the factors. Let 
us order y column-wise from Table 5.1, as in y = [9.2988, 8.2111, 9.0688, 8.2552, 
9.4978, ...,8.9485]", and let Treatment be Factor 1 and Block be Factor 2. Then we can 
write (5.16) as 


Bo 
1000 Cf” 000 
1l1iodooc 100 
Y=!) 010 c| EH tS where C=lo 1 oF 
1 0 0 1 c|? 001 

cea ——— fz 

x 24 

<a 

B 


and with 1 = [1,1,1,1]' and 0 = [0,0,0,0]". Estimation of B and ao, model selection, 
and prediction can now be carried out in the usual manner for linear models. o 


In the context of factorial experiments, the model matrix is often called the design 
matrix, as it specifies the design of the experiment; e.g., how many replications are taken 
for each combination of factor levels. The model (5.16) can be extended by adding products 
of indicator variables as new features. Such features are called interaction terms. 


INCREMENTAL 
EFFECTS 


DESIGN MATRIX 


INTERACTION 
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NESTED MODELS 


5.3.6 Nested Models 


Let X be an X p model matrix of the form X = [X,, X2], where X; and X, are model 
matrices of dimension n x k and n x (p — k), respectively. The linear models Y = Xf, +€ 
and Y = X,f, + € are said to be nested within the linear model Y = Xf + e. This simply 
means that certain features in X are ignored in each of the first two models. Note that 8, 64, 
and p, are parameter vectors of dimension p, k, and p — k, respectively. In what follows, 
we assume that n > p and that all model matrices are full-rank. 

Suppose we wish to assess whether to use the full model matrix X or the reduced model 
matrix X,. Let B be the estimate of £ under the full model (that is, obtained via (5.9)), and 
let B, denote the estimate of B, for the reduced model. Let Y = X£ be the projection of Y 
onto the space Span(X) spanned by the columns of X; and let Y" = XB, be the projection 
of Y onto the space Span(X,) spanned by the columns of X; only; see Figure 5.3. In order 
to decide whether the features in X, are needed, we may compare the estimated error terms 
of the two models, as calculated by (5.10); that is, by the residual sum of squares divided 
by the number of observations n. If the outcome of this comparison is that there is little 
difference between the model error for the full and reduced model, then it is appropriate to 
adopt the reduced model, as it has fewer parameters than the full model, while explaining 
the data just as well. The comparison is thus between the squared norms ||Y — Y|/? and 
IY — Y‘||?. Because of the nested nature of the linear models, Span(X_,) is a subspace of 
Span(X) and, consequently, the orthogonal projection of Y® onto Span(X;) is the same 
as the orthogonal projection of Y onto Span(X,); that is, Y”. By Pythagoras’ theorem, we 
thus have the decomposition ||Y° — Y”| + |¥—Y|?? = ||Y-—Y|?. This is also illustrated 
in Figure 5.3. 


y-Y” 















y2 


y2 - yo 


Span(X) 


Figure 5.3: The residual sum of squares for the full model corresponds to ||Y— Y |? and for 
the reduced model it is ||Y—Y||?. By Pythagoras’s theorem, the difference is YO —Y |, 


The above decomposition can be generalized to more than two model matrices. Sup- 
pose that the model matrix can be decomposed into d submatrices: X = [Xj, Xo,..., Xa], 
where the matrix X; has p; columns and n rows, i = 1,...,d. Thus, the number of columns? 





> As always, we assume the columns are linearly independent. 
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in the full model matrix is p = pı +++ -+ pa. This creates an increasing sequence of “nested” 
model matrices: X4, [X1, X2],...,[X1, Xo,..., X4], from (say) the baseline normal model 
matrix X; = 1 to the full model matrix X. Think of each model matrix corresponding to 
specific variables in the model. 

We follow a similar projection procedure as in Figure 5.3: First project Y onto Span(X) 
to yield the vector Y, then project Y onto Span([Xj,..., X4-1]) to obtain YP, and so 
on, until ¥Y is projected onto Span(X,) to yield Y® = Y1 (in the case that X; = 1). 

By applying Pythagoras’ theorem, the total sum of squares can be decomposed as 


IY — YO = IY — YO + VO — YOVP +--+ YO-YO. (5.17) 
— m — a m — mm 


df=n-p; df=n—p df=pa df=p2 


Software packages typically report the sums of squares as well as the corresponding de- 
grees of freedom (df): n — p, pa, . . - , P2- 


5.3.7 Coefficient of Determination 


To assess how a linear model Y = Xf + € compares to the default model Y = 6o1 + £, we 
can compare the variance of the original data, estimated via )),(Y; — Y)?/n = |Y — YA|2/n, 
with the variance of the fitted data; estimated via YAY; - Y)2/n = |Y — Y1[2/n, where 
Y = XB. The sum X;(Y; — Y)?/n = ||Y — Y1||* is sometimes called the total sum of squares 
(TSS), and the quantity = 

pe P- 


= — 5.18 
IY — Y1? ae 


is called the coefficient of determination of the linear model. In the notation of Figure 5.3, 
Y = Y® and Y1 = Y®, so that 


Ra WWO-YOP IY - YOP - IY - Y®IP _ TSS -RSS 
o Y-y®e Y-Y” © TSS 


Note that R? lies between 0 and 1. An R? value close to 1 indicates that a large propor- 
tion of the variance in the data has been explained by the model. 

Many software packages also give the adjusted coefficient of determination, or simply 
the adjusted R?, defined by 


2 2,n- 1 
Ridjusted =1 -(1 -R De -p 

The regular R° is always non-decreasing in the number of parameters (see Exercise 15), 
but this may not indicate better predictive power. The adjusted R? compensates for this 
increase by decreasing the regular R? as the number of variables increases. This heuristic 
adjustment can make it easier to compare the quality of two competing models. 


DEGREES OF 
FREEDOM 
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DETERMINATION 


ADJUSTED 
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5.4 Inference for Normal Linear Models 


So far we have not assumed any distribution for the random vector of errors € = 
[€],...,€]' in a linear model Y = XB + e. When the error terms {s;} are assumed to be 
normally distributed (that is, {¢;} ~ia N(O, o)), whole new avenues open up for inference 
on linear models. In Section 2.8 we already saw that for such normal linear models, estim- 
ation of B and g? can be carried out via maximum likelihood methods, yielding the same 
estimators from (5.9) and (5.10). 

The following theorem lists the properties of these estimators. In particular, it shows 
that B and o?n/(n — p) are independent and unbiased estimators of 8 and o”, respectively. 


Theorem 5.3: Properties of the Estimators for a Normal Linear Model 





Proof: Using the pseudo-inverse (Definition A.2), we can write the random vector B as 
X*Y, which is a linear transformation of a normal random vector. Consequently, 6 has a 
multivariate normal distribution; see Theorem C.6. The mean vector and covariance matrix 
follow from the same theorem: 


EB =X'*EY =X'XB=8 


and 
Cov(B) = X*o7L,(X*)" = 0° (XTX)*. 


To show that B and g? are independent, define Y° = XB. Note that Y/o has a N(u, Iņ) 
distribution, with expectation vector u = Xf/c. A direct application of Theorem C.10 
now shows that (Y — Y)/c is independent of Y/c. Since B = X+XB = X*Y® and 
= = |Y — Y\/?/n, it follows that oi is inoepondsnt of B. Finally, by the same theorem, 
the random variable ||Y — Y||?/o” has a y;_,, distribution, as Y® has the same expectation 
vector as Y. o 


As a corollary, we see that each estimator 6; of 6; has a normal distribution with expect- 
ation 6; and variance 07u)X*(X*)"u; = o* lu) X+, where u; = [0,...,0,1,0,...,0]" is 
the i-th unit vector; in other words, the variance is o?[(X™ X)*]ji. 

It is of interest to test whether certain regression parameters p; are O or not, since if 
Bi = 0, the i-th explanatory variable has no direct effect on the expected response and so 
could be removed from the model. A standard procedure is to conduct a hypothesis test 
(see Section C.14 for a review of hypothesis testing) to test the null hypothesis Ho : 6; = 0 
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against the alternative H; : 6; # 0, using the test statistic 
y Billa XI 
VRSE ~ 


where RSE is the residual squared error; that is RSE = RSS/(n — p). This test statistic has 
a tn-p distribution under Hp. To see this, write T = Z/./V/(n — p), with 


(5.19) 


__ Bi 


= and V= no2/o. 
ollu; X*|| 


Then, by Theorem 5.3, Z ~ N(0, 1) under Ho, V ~ oan and Z and V are independent. The 
result now follows directly from Corollary C.1. 


5.4.1 Comparing Two Normal Linear Models 


Suppose we have the following normal linear model for data Y = [Y,,..., Y,]": 


Y=X,\B,+X.f,+e, e~N(O,o'l,), (5.20) 
— a 
xB 


where £, and $, are unknown vectors of dimension k and p — k, respectively; and X; 
and X, are full-rank model matrices of dimensions n x k and n X (p — k), respectively. 
Above we implicitly defined X = [X;, X2] and B' = [B] , $; ]. Suppose we wish to test the 
hypothesis Ho : B, = 0 against H; : B, + 0. Following Section 5.3.6, the idea is to compare 
the residual sum of squares for both models, expressed as ||¥ — Y¥||? and ||¥ - ¥|?. Using 
Pythagoras’ theorem we saw that ||¥ — Y|? — |Y — Y |? = ||¥Y° — Y|?, and so it makes 
sense to base the decision whether to retain or reject Ho on the basis of the quotient of 
|Y° — Y |? and ||Y — Y||?. This leads to the following test statistics. 


Theorem 5.4: Test Statistic for Comparing Two Normal Linear Models 





Proof: Define X := Y/o with expectation u := XB/o, and X; := Y” /o with expectation 
H; J =k, p. Note that u, = wand, under Ho, 4, = pp. We can directly apply Theorem C.10 
to find that ||¥ — ¥Y|P?/o? = ||X — X |)? ~ y3_, and, under Hp, ||¥Y° - Y|P/o? = |X, - 
Xd ~ eee Moreover, these random variables are independent of each other. The proof 
is completed by applying Theorem C.11. o 
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ANALYSIS OF 
VARIANCE 


Note that Ho is rejected for large values of T. The testing procedure thus proceeds as 
follows: 


1. Compute the outcome, t say, of the test statistic T in (5.21). 
2. Evaluate the P-value P(T > t), with T ~ F(p — k,n — p). 
3. Reject Ho if this P-value is too small, say less than 0.05. 


For nested models [X1, Xo,..., Xz], k = 1,2,...,d, as in Section 5.3.6, the F test stat- 
istic in Theorem 5.4 can now be used to test whether certain X; are needed or not. In 
particular, software packages will report the outcomes of 


WYO - YOM /p; 
F; = —__—_"__, (5.22) 
IY - YO|F/(~ - p) 


in the order i = 2,3,...,d. Under the null hypothesis that Y and Y“ have the same ex- 
pectation (that is, adding X; to X;_; has no additional effect on reducing the approximation 
error), the test statistic F; has an F(p;,n — p) distribution, and the corresponding P-values 
quantify the strength of the decision to include an additional variable in the model or not. 
This procedure is called analysis of variance (ANOVA). 


Note that the output of an ANOVA table depends on the order in which the variables 


are considered. 





m Example 5.6 (Crop Yield (cont.)) We continue Examples 5.4 and 5.5. Decompose the 
linear model as 


22 
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Is the crop yield dependent on treatment levels as well as blocks? We first test whether we 
can remove Block as a factor in the model against it playing a significant role in explain- 
ing the crop yields. Specifically, we test B; = 0 versus B; # 0 using Theorem 5.4. Now 
the vector Y” is the projection of Y onto the (p = 7)-dimensional space spanned by the 
columns of X = [X,, X>, X3]; and Y"” is the projection of Y onto the (k = 4)-dimensional 
space spanned by the columns of Xj := [Xj, X2]. The test statistic, T12 say, under Hp has 
an F(3, 9) distribution. 

The Python code below calculates the outcome of the test statistic T,2 and the corres- 
ponding P-value. We find ft. = 34.9998, which gives a P-value 2.73 x 1075. This shows 
that the block effects are extremely important for explaining the data. 

Using the extended model (including the block effects), we can test whether p, = 0 or 
not; that is, whether the treatments have a significant effect on the crop yield in the presence 
of the Block factor. This is done in the last six lines of the code below. The outcome of 
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the test statistic is 4.4878, with a P-value of 0.0346. By including the block effects, we 
effectively reduce the uncertainty in the model and are able to more accurately assess the 
effects of the treatments, to conclude that the treatment seems to have an effect on the crop 
yield. A closer look at the data shows that within each block (row) the crop yield roughly 
increases with the treatment level. 


crop.py 


import numpy as np 
from scipy.stats import f 
from numpy.linalg import lstsq, norm 


yy = np.array([9.2988, 9.4978, 9.7604, 10.1025, 
8.2111, 8.3387, 8.5018, 8.1942, 
9.0688, 9.1284, 9.3484, 9.5086, 
8.2552, 7.8999, 8.4859, 8.9485]).reshape(4,4).T 


nrow, ncol = yy.shape[0], yy.shape[1] 
nrow * ncol 
yy .reshape(16,) 
np.ones((n,1)) 


KM = np.kron(np.eye(ncol) ,np.ones((nrow,1))) 
KM[: , 0] 
X_2 = KM[:,1:ncol] 
IM = np.eye(nrow) 
IM[:,1:nrow] 


= np.vstack((C, C)) 
np.vstack((X_3, C)) 
= np.vstack((X_3, C)) 


= np.hstack((X_1,X_2)) 
np.hstack((X, X_3)) 


p = X.shape[1] #number of parameters in full model 
betahat = Istsq(X, y,rcond=None)[0] #estimate under the full model 


X @ betahat 


X_12 = np.hstack((X_1, X_2)) #omitting the block effect 

k = X_12.shape[1] #number of parameters in reduced model 
betahat_12 = lstsq(X_12, y,rcond=None) [0] 

y_12 = X_12 @ betahat_12 

T_12=(n-p)/(Cp-k) *(norm(y-y_12)**2 - norm(Cy-ym) **2)/norm(y-ym) **2 
pval_12 = 1 - f.cdf(T_12,p-k,n-p) 


X_13 = np.hstack((X_1, X_3)) #omitting the treatment effect 

k = X_13.shape[1] #number of parameters in reduced model 
betahat_13 = lstsq(X_13, y,rcond=None) [0] 

y_13 = X_13 @ betahat_13 

T_13=(n-p)/(Cp-k) *(norm(y-y_13)**2 - norm(Cy-ym) **2) /norm(y-ym) **2 
pval_13 = 1 - f.cdf(T_13,p-k,n-p) 
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5.4.2 Confidence and Prediction Intervals 


As in all supervised learning settings, linear regression is most useful when we wish to 
predict how a new response variable will behave on the basis of a new explanatory vector 
x. For example, it may be difficult to measure the response variable, but by knowing the 
estimated regression line and the value for x, we will have a reasonably good idea what Y 
or the expected value of Y is going to be. 

Thus, consider a new x and let Y ~ N(x'B,o”), with B and o? unknown. First we 
are going to look at the expected value of Y, that is EY = x". Since 2 is unknown, we 
do not know EY either. However, we can estimate it via the estimator Y = xB, where 
B~ N(B, o?(X™X)*), by Theorem 5.3. Being linear in the components of £, Y therefore 
has a normal distribution with expectation x'B and variance o?||x™X*||?. Let Z ~ N(O, 1) 
be the standardized version of Y and V = IY — XBi? [0 ~ x -p: Then the random variable 





(xB - xB) / Ix XIW D O Z 
IY - XB /V@— p) Vin = p) 


has, by Corollary C.1, a t,_, distribution. After rearranging the identity P(|T| < tn-p;1-a/2) = 
1 — a, where t,-p:1-a/2 is the (1 — a/2) quantile of the t,_, distribution, we arrive at the 
stochastic confidence interval 


Tis (5.23) 


XB + tn-p;1-a2 VRSE |x" X" Il, (5.24) 


where we have identified ||Y — XBi? /(n — p) with RSE. This confidence interval quantifies 
the uncertainty in the learner (regression surface). 

A prediction interval for a new response Y is different from a confidence interval for 
EY. Here the idea is to construct an interval such that Y lies in this interval with a certain 
guaranteed probability. Note that now we have two sources of variation: 


1. Y ~ N(x'B, co”) itself is a random variable. 
2. Estimating x'B via Y brings another source of variation. 


We can construct a (1 — œ) prediction interval, by finding two random bounds such that 
the random variable Y lies between these bounds with probability 1 — œ. We can reason as 
follows. Firstly, note that Y ~ N(x"B,o’) and Y ~ N(x"B, o?||x™X*||*) are independent. It 
follows that Y — Y has a normal distribution with expectation 0 and variance 


o(1 + |x Xt] P). (5.25) 


Secondly, letting Z ~ N(0, 1) be the standardized version of Y — Y, and repeating the 
steps used for the construction of the confidence interval (5.24), we arrive at the prediction 


interval 
x'B+ tn-p:1-a/2 VRSE y1 + xT X+]|2. (5.26) 


This prediction interval captures the uncertainty from an as-yet-unobserved response as 
well as the uncertainty in the parameters of the regression model itself. 
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E Example 5.7 (Confidence Limits in Simple Linear Regression) The following pro- 
gram draws n = 100 samples from a simple linear regression model with parameters 
B = [6,13]' and o = 2, where the x-coordinates are evenly spaced on the interval [0, 1]. 
The parameters are estimated in the third block of the code. Estimates for B and o are 
[6.03, 13.09]" and F = 1.60, respectively. The program then proceeds by calculating the 
95% numeric confidence and prediction intervals for various values of the explanatory 
variable. Figure 5.4 shows the results. 


confpred.py 


import numpy as np 

import matplotlib.pyplot as plt 

from scipy.stats import t 

from numpy.linalg import inv, lstsq, norm 
np.random.seed(123) 


= 100 
= np.linspace(0.01,1,100).reshape(n,1) 
parameters 
beta = np.array([6,13]) 
sigma = 2 
Xmat = np.hstack((np.ones((n,1)), x)) #design matrix 
y = Xmat @ beta + sigma*np.random.randn(n) 


# solve the normal equations 

betahat = lstsq(Xmat, y,rcond=None) [0] 

# estimate for sigma 

sqMSE = norm(y - Xmat @ betahat)/np.sqrt(n-2) 


tquant = t.ppf(0.975,n-2) # 0.975 quantile 
ucl = np.zeros(n) #upper conf. limits 

lcl = np.zeros(n) #lower conf. limits 

upl = np.zeros(n) 

lpl = . zeros (n) 

rl = np.zeros(n) # (true) regression line 


for i in range(n): 
u = u + 1/n; 
xvec = np.array([1,u]) 
sqc = np.sqrt(xvec.T @ inv(Xmat.T @ Xmat) @ xvec) 
sqp = np.sqrt(1 + xvec.T @ inv(Xmat.T @ Xmat) @ xvec) 
rili] = xvec.T @ beta; 
ucl[i] = xvec.T @ betahat + tquant*sqMSE*sqc; 
lcl[i] xvec.T @ betahat - tquant*sqMSE*sqc; 
upl[i] xvec.T @ betahat + tquant*sqMSE*sqp; 
lpl [i] xvec.T @ betahat tquant*sqMSE*sqp; 


plt.plot(x,y, '.') 
plt.plot(x,rl,'b') 
plt.plot(x,ucl, 
plt.plot(x,lcl, 
plt.plot(x,upl,' 
plt.plot(x,l1pl, 
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Figure 5.4: The true regression line (blue, solid) and the upper and lower 95% prediction 
curves (red, dashed) and confidence curves (dotted). Oo 


5.5 Nonlinear Regression Models 


So far we have been mostly dealing with linear regression models, in which the predic- 
tion function is of the form g(x |B) = x'B. In this section we discuss some strategies for 
handling general prediction functions g(x | £), where the functional form is known up to an 
unknown parameter vector 8. So the regression model becomes 


Y,;=g(%|P)+e, t=1,...,n, (5.27) 


where £1, ... , €n are independent with expectation 0 and unknown variance o”. The model 
can be further specified by assuming that the error terms have a normal distribution. 

Table 5.3 gives some common examples of nonlinear prediction functions for data tak- 
ing values in R. 


Table 5.3: Common nonlinear prediction functions for one-dimensional data. 





Name g(x|P) B 
Exponential a e>* a,b 
Powerlaw ax? a,b 
Logistic (1 +e)! a,b 
Weibull 1 —exp(—x’/a) a,b 
Polynomial =?) Bex P, Budo 


The logistic and polynomial prediction functions in Table 5.3 can be readily gener- 
alized to higher dimensions. For example, for x € R? a general second-order polynomial 
prediction function is of the form 


g(x |B) = Bo + Bi x1 + Box. + Bi x} + Bo x5 + b12 X1 X2. (5.28) 
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This function can be viewed as a second-order approximation to a general smooth predic- 
tion function g(x, x2); see also Exercise 4. Polynomial regression models are also called 
response surface models. The generalization of the above logistic prediction to Rf is 


e(x|B)=(1 +e" A, (5.29) 


This function will make its appearance in Section 5.7 and later on in Chapters 7 and 9. 

The first strategy for performing regression with nonlinear prediction functions is to 
extend the feature space to obtain a simpler (ideally linear) prediction function in the ex- 
tended feature space. We already saw an application of this strategy in Example 2.1 for 
the polynomial regression model, where the original feature u was extended to the feature 
vector x = [l,u,u’,...,u?~']", yielding a linear prediction function. In a similar way, the 
right-hand side of the polynomial prediction function in (5.28) can be viewed as a linear 
function of the extended feature vector @(x) = [1, x1, x2, a x5, xıx2]'. The function @ is 
called a feature map. 

The second strategy is to transform the response variable y and possibly also the ex- 
planatory variable x such that the transformed variables y, x are related in a simpler (ideally 
linear) way. For example, for the exponential prediction function y = ae™*, we have 
Iny = Ina — bx, which is a linear relation between In y and [1, x]". 


E Example 5.8 (Chlorine) Table 5.4 lists the free chlorine concentration (in mg per liter) 
in a swimming pool, recorded every 8 hours for 4 days. A simple chemistry-based model 
for the chlorine concentration y as a function of time t is y = ae~’', where a is the initial 
concentration and b > 0 is the reaction rate. 


Table 5.4: Chlorine concentration (in mg/L) as a function of time (hours). 








Hours Concentration Hours Concentration 

0 1.0056 56 0.3293 
8 0.8497 64 0.2617 
16 0.6682 72 0.2460 
24 0.6056 80 0.1839 
32 0.4735 88 0.1867 
40 0.4745 96 0.1688 
48 0.3563 


The exponential relationship y = a e™ suggests that a log transformation of y will result 
in a linear relationship between In y and the feature vector [1, t]'. Thus, if for some given 
data (t;,1),.--5 (tn, Yn), We plot (t1, In y;),...,(t,, IN Yn), these points should approximately 
lie on a straight line, and hence the simple linear regression model applies. The left panel of 
Figure 5.5 illustrates that the transformed data indeed lie approximately on a straight line. 
The estimated regression line is also drawn here. The intercept and slope are By = —0.0555 
and £, = —0.0190 here. The original (non-transformed) data is shown in the right panel 
of Figure 5.5, along with the fitted curve y = ae”, where @ = exp(Bo) = 0.9461 and 
b = -£, = 0.0190. 


RESPONSE 
SURFACE MODEL 


is 26 


FEATURE MAP 
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Figure 5.5: The chlorine concentration seems to have an exponential decay. E 


Recall that for a general regression problem the learner g,(x) for a given training set T 
is obtained by minimizing the training (squared-error) loss 


1 n 
LABEEN = = X Oi- gail BY. (5.30) 
a= 
The third strategy for regression with nonlinear prediction functions is to directly minimize 


(5.30) by any means possible, as illustrated in the next example. 


E Example 5.9 (Hougen Function) In [7] the reaction rate y of a certain chemical reac- 
tion is posited to depend on three input variables: quantities of hydrogen xı, n-pentane x», 
and isopentane x3. The functional relationship is given by the Hougen function: 


y= By X2 — 3/85 
1 + By x1 + B3 x2 + Ba x3” 
where (),...,/5 are the unknown parameters. The objective is to estimate the model para- 


meters {§;} from the data, as given in Table 5.5. 


Table 5.5: Data for the Hougen function. 








X X2 X3 y x] x2 X3 y 
470 300 10 8.55 470 190 65 435 
285 80 10 3.79 100 300 54 13.00 
470 300 120 4.82 100 300 120 8.50 
470 80 120 0.02 100 80 120 0.05 
4710 80 10 2.75 285 300 10 11.32 
100 190 10 14.39 285 190 120 3.13 


100 80 65 2.54 


The estimation is carried out via the least-squares method. The objective function to 
minimize is thus 


` b Bi Xi2 — X3/Bs ). (5.31) 


1 
€,(g(- |B)) = B » ~ 1 + Bo Xi + b3 Xin + p4 Xi3 


i=1 
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where the {y;} and {x;;} are given in Table 5.5. 

This is a highly nonlinear optimization problem, for which standard nonlinear least- 
squares methods do not work well. Instead, one can use global optimization methods such 
as CE and SCO (see Sections 3.4.2 and 3.4.3). Using the CE method, we found the minimal 
value 0.02299 for the objective function, which is attained at 


~ 


B = [1.2526, 0.0628, 0.0400, 0.1124, 1.1914]". 


5.6 Linear Models in Python 


In this section we describe how to define and analyze linear models using Python and the 
data science module statsmodels. We encourage the reader to regularly refer back to 
the theory in the preceding sections of this chapter, so as to avoid using Python merely 
as a black box without understanding the underlying principles. To run the code start by 
importing the following code snippet: 


import matplotlib.pyplot as plt 
import pandas as pd 


import statsmodels.api as sm 
from statsmodels.formula.api import ols 





5.6.1 Modeling 


Although specifying a normal? linear model in Python is relatively easy, it requires some 
subtlety. The main thing to realize is that Python treats quantitative and qualitative (that 
is, categorical) explanatory variables differently. In statsmodels, ordinary least-squares 
linear models are specified via the function ols (short for ordinary least-squares). The 
main argument of this function is a formula of the form 


y~x14+x2+---4+xd, (5.32) 


where y is the name of the response variable and x1, ..., xd are the names of the explan- 
atory variables. If all variables are quantitative, this describes the linear model 


Y; = Bo + pixa + Porta +++- + Pye +E, t= 1st, (5.33) 


where x;; is the j-th explanatory variable for the i-th observation and the errors s; are 
independent normal random variables such that Ee; = 0 and Vars; = o°. Or, in matrix 
form: Y = X$ + e, with 


l Xu hte 
Yı Bo E 
f 1 xa +++ Xd 
Y=|:], Xs). «= « ie = , and €= 
Y, . . * . 4 E€ 
1 Xni >t Xnd 4 





3For the rest of this section, we assume all linear models to be normal. 


ns 414 


rs 100 
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(= 3 


Thus, the first column is always taken as an “intercept” parameter, unless otherwise spe- 
cified. To remove the intercept term, add -1 to the ols formula, as in ols(’ y~x-1’). 
For any linear model, the model matrix can be retrieved via the construction: 


model_matrix = pd.DataFrame (model .exog,columns=model .exog_names) 


Let us look at some examples of linear models. In the first model the variables x1 and x2 
are both considered (by Python) to be quantitative. 


myData = pd.DataFrame({'y' : [10,9,4,2,4,9], 
Mes A 4s U2 348), 2 Og Oe Dl 
mca ee Ol 22 Saleh) 
mod = ols("y~x1+x2", data=myData) 
mod_matrix = pd.DataFrame(mod.exog ,columns=mod.exog_names) 
print (mod_matrix) 


Intercept 
il 





Suppose the second variable is actually qualitative; e.g., it represents a color, and the 
levels 1, 2, and 3 stand for red, blue, and green. We can account for such a categorical 
variable by using the astype method to redefine the data type (see Section 1.2). 


myData['x2'] = myData['x2'].astype('category') 


Alternatively, a categorical variable can be specified in the model formula by wrapping 
it with CC). Observe how this changes the model matrix. 


mod2 = olsC"y~x1+C(x2)", data=myData) 
mod2_matrix = pd.DataFrame(mod2.exog,columns=mod2.exog_names) 
print (mod2_matrix) 


Intercep CCx2)[T.2] CCx2)[T.3 
ie : ; 





Thus, if a statsmodels formula of the form (5.32) contains factor (qualitative) variables, 
the model is no longer of the form (5.33), but contains indicator variables for each level of 
the factor variable, except the first level. 

For the case above, the corresponding linear model is 


Y; = Bo + Bi xi +a L{xn = 2} + Q3 L{x2 = 3} + &, i= | ee is , 6, (5.34) 


where we have used parameters @ and œ3 to correspond to the indicator features of the 
qualitative variable. The parameter œ describes how much the response is expected to 
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change if the factor x. switches from level 1 to 2. A similar interpretation holds for a3. 
Such parameters can thus be viewed as incremental effects. 

It is also possible to model interaction between two variables. For two continuous INTERACTION 


variables, this simply adds the products of the original features to the model matrix. Adding 


interaction terms in Python is achieved by replacing “+” in the formula with “*”, as the 
following example illustrates. 


mod3 = olsC"y~x1*C(x2)", data=myData) 
mod3_matrix = pd.DataFrame(mod3.exog,columns=mod3.exog_names) 
print (mod3_matrix) 


Intercept C(x2)[T.2] CcCx2)[T.3 XI CGRAy [Ut 2 sil sC Cer) |r. 3] 
ila A P 


5.6.2 Analysis 


Let us consider some easy linear regression models by using the student survey data set 
survey .csv from the book’s GitHub site, which contains measurements such as height, 
weight, sex, etc., from a survey conducted among n = 100 university students. Suppose we 
wish to investigate the relation between the shoe size (explanatory variable) and the height 
(response variable) of a person. First, we load the data and draw a scatterplot of the points 
(height versus shoe size); see Figure 5.6 (without the fitted line). 


survey = pd.read_csv('survey.csv') 
plt.scatter(survey.shoe, survey.height) 
plt.xlabel("Shoe size") 

plt.ylabel ("Height") 














15 20 25 30 35 
Shoe size 


Figure 5.6: Scatterplot of height (cm) against shoe size (cm), with the fitted line. 
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We observe a slight increase in the height as the shoe size increases, although this 
relationship is not very distinct. We analyze the data through the simple linear regression 
model Y; = Bo + Bixi + &,i = 1,...,n. In statsmodels this is performed via the ols 
method as follows: 


model = ols("height~shoe", data=survey) # define the model 
fit = model.fitQ #fit the model defined above 

b®, b1 = fit.params 

print (fit.params) 


Intercept 145.777570 
shoe 1.004803 
dtype: float64 





The above output gives the least-squares estimates of By and 6. For this example, we 
have Bo = 145.778 and 6, = 1.005. Figure 5.6, which includes the regression line, was 
obtained as follows: 


.plot(survey.shoe, b® + bl*survey.shoe) 
.scatter(survey.shoe, survey.height) 
-Xlabel("Shoe size") 

-ylabel ("Height") 





Although ols performs a complete analysis of the linear model, not all its calculations 
need to be presented. A summary of the results can be obtained with the method summary. 


print (fit.summary ()) 


Dep. Variable: height R-squared: 0.178 
Model: OLS Adj. R-squared: 0.170 
Method: Least Squares F-statistic: 21.28 
No. Observations: 100 Prob (F-statistic): 1.20e-05 
Df Residuals: 98 Log-Likelihood: -363.88 
Df Model: 1 : 731.8 
Covariance Type: nonrobust 





Intercept 145.7776 5.763 è : 134.341 157.214 
1.0048 0.218 





Omnibus: i Durbin-Watson: 
Prob (Omnibus): : Jarque-Bera (JB): 
Skew: ; Prob(JB): 
Kurtosis: : Cond. No. 





The main output items are the following: 


e coef: Estimates of the parameters of the regression line. 


e std error: Standard deviations of the estimators of the regression line. These are 
the square roots of the variances of the {6;} obtained in (5.25). 
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e t: Realization of Student’s test statistics associated with the hypotheses Ho : 6; = 0 
and H; : 6; + 0, i = 0,1. In particular, the outcome of T in (5.19). iS 183 
e P>|t|: P-value of Student’s test (two-sided test). 
e [0.025 0.975]: 95% confidence intervals for the parameters. 
e R-Squared: Coefficient of determination R? (percentage of variation explained by 
the regression), as defined in (5.18). rs 181 
e Adj. R-Squared: adjusted R? (explained in Section 5.3.7). 
e F-statistic: Realization of the F test statistic (5.21) associated with testing the 1 183 
full model against the default model. The associated degrees of freedom (Df Model 
= landDf Residuals =n—2) are given, as is the P-value: Prob (F-statistic). 
e AIC: The AIC number in (5.15); that is, minus two times the log-likelihood plus two rs 177 


times the number of model parameters (which is 3 here). 


You can access all the numerical values as they are attributes of the fit object. First 
check which names are available, as in: 


dir( fit) 


Then access the values via the dot construction. For example, the following extracts the 
P-value for the slope. 


fit.pvalues [1] 


1.1994e-05 


The results show strong evidence for a linear relationship between shoe size and height 
(or, more accurately, strong evidence that the slope of the regression line is not zero), as 
the P-value for the corresponding test is very small (1.2 - 10”). The estimate of the slope 
indicates that the difference between the average height of students whose shoe size is 
different by one cm is 1.0048 cm. 

Only 17.84% of the variability of student height is explained by the shoe size. We 
therefore need to add other explanatory variables to the model (multiple linear regression) 
to increase the model’s predictive power. 


5.6.3 Analysis of Variance (ANOVA) 


We continue the student survey example of the previous section, but now add an extra 
variable, and also consider an analysis of variance of the model. Instead of “explaining” 
the student height via their shoe size, we include weight as an explanatory variable. The 
corresponding ols formula for this model is 


height~shoe + weight, 


196 


5.6. Linear Models in Python 








meaning that each random height, denoted by Height, satisfies 
Height = po + 6, shoe + weight + e, 


where £ is a normally distributed error term with mean 0 and variance o°. Thus, the model 
has 4 parameters. Before analyzing the model we present a scatterplot of all pairs of vari- 
ables, using scatter_matrix. 


model = ols("height~shoe+weight", data=survey) 

fit = model. fitQ 

axes = pd.plotting.scatter_matrix( 
survey[['height','shoe', 'weight']]) 


plt.show() 














o 
= 
q 


weight 


Figure 5.7: Scatterplot of all pairs of variables: height (cm), shoe (cm), and weight (kg). 


As for the simple linear regression model in the previous section, we can analyze the 
model using the summary method (below we have omitted some output): 


fit. summary () 


Dep. Variable: height R-squared: 

Model: OLS Adj. R-squared: 
Method: Least Squares F-statistic: 

No. Observations: Prob (F-statistic): 
Df Residuals: Log-Likelihood: 

Df Model: 


Intercept 132.2677 ; : i 121.853 142.682 
shoe 0.5304 : : : 0.141 0.920 
weight 0.3744 : 5 3 0.261 0.488 
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The F-statistic is used to test whether the full model (here with two explanatory 
variables) is better at “explaining” the height than the default model. The corresponding 
null hypothesis is Hp : 61 = £2 = 0. The assertion of interest is H4: at least one of the coeff- 
cients £; (j = 1,2) is significantly different from zero. Given the result of this test (P-value 
= 1.429 . 107!?), we can conclude that at least one of the explanatory variables is associated 
with height. The individual Student tests indicate that: 


e shoe size is linearly associated with student height, after adjusting for weight, with 
P-value 0.0081. At the same weight, an increase of one cm in shoe size corresponds 
to an increase of 0.53 cm in average student height; 


e weight is linearly associated with student height, after adjusting for shoe size (the 
P-value is actually 2.82 - 107%; the reported value of 0.000 should be read as “less 
than 0.001”). At the same shoe size, an increase of one kg in weight corresponds to 
an increase of 0.3744 cm in average student height. 


Further understanding is extracted from the model by conducting an analysis of vari- 
ance. The standard statsmodels function is anova_lm. The input to this function is the 
fit object (obtained from model . fit ()) and the output is a DataFrame object. 


table = sm.stats.anova_lm(fit) 
print (table) 


d sum_sq mean_sq F PR(>F) 


shoe 1840.467359 1840.467359 30.371310 2.938651e-07 


ila 
weight 1. 2596.275747 2596.275747 42.843626 2.816065e-09 
7 


Residual 97. 5878.091294 60.598879 NaN NaN 


The meaning of the columns is as follows. 


e df: The degrees of freedom of the variables, according to the sum of squares decom- 
position (5.17). As both shoe and weight are quantitative variables, their degrees 
of freedom are both | (each corresponding to a single column in the overall model 
matrix). The degrees of freedom for the residuals is n — p = 100 — 3 = 97. 


e sum_sq: The sum of squares according to (5.17). The total sum of squares is the 
sum of all the entries in this column. The residual error in the model that cannot be 
explained by the variables is RSS ~ 5878. 


e mean_sq: The sum of squares divided by their degrees of freedom. Note that the 
residual mean square error RSE = RSS/(n— p) = 60.6 is an unbiased estimate of the 
model variance g?; see Section 5.4. 


e F: These are the outcomes of the test statistic (5.22). 
e PR(>F): These are the P-values corresponding to the test statistic in the preceding 


column and are computed using an F distribution whose degrees of freedom are 
given in the df column. 





rs 181 


rs 182 
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The ANOVA table indicates that the shoe variable explains a reasonable amount of the 
variation in the model, as evidenced by a sum of squares contribution of 1840 out of 1840+ 
2596+5878 = 10314 and a very small P-value. After shoe is included in the model, it turns 
out that the weight variable explains even more of the remaining variability, with an even 
smaller P-value. The remaining sum of squares (5878) is 57% of the total sum of squares, 
yielding a 43% reduction, in accordance with the R? value reported in the summary for the 
ols method. As mentioned in Section 5.4.1, the order in which the ANOVA is conducted 
is important. To illustrate this, consider the output of the following commands. 


model = ols("height~weight+shoe", data=survey) 
fit = model. fitQ 

table = sm.stats.anova_lm(fit) 

print (table) 


sum_sq mean_sq F PR(>F) 
weight : 3993.860167 3993.860167 65.906502 1.503553e-12 
shoe : 442.882938 442.882938 7.308434 8.104688e-03 
Residual : 5878.091294 60.598879 NaN NaN 





We see that weight as a single model variable explains much more of the variability 
than shoe did. If we now also include shoe, we only obtain a small (but according to the 
P-value still significant) reduction in the model variability. 


5.6.4 Confidence and Prediction Intervals 


In statsmodels a method for computing confidence or prediction intervals from a dic- 
tionary of explanatory variables is get_prediction. It simply executes formula (5.24) or 
(5.26). A simpler version is predict, which only returns the predicted value. 

Continuing the student survey example, suppose we wish to predict the height of a 
person with shoe size 30 cm and weight 75 kg. Confidence and prediction intervals can 
be obtained as given in the code below. The new explanatory variable is entered as a dic- 
tionary. Notice that the 95% prediction interval (for the corresponding random response) is 
much wider than the 95% confidence interval (for the expectation of the random response). 


x = {'shoe': [30.0], 'weight': [75.0]} # new input (dictionary) 
pred = fit.get_prediction(x) 
pred.summary_frame (alpha=0.05).unstack() 


mean 261722 predicted value 
mean_se -054015 


mean_ci_lower 169795 lower bound for 
mean_ci_upper 353650 upper bound for 
obs_ci_lower 670610 lower bound for 
obs_ci_upper 852835 upper bound for 
dtype: float64 





5.6.5 Model Validation 


We can perform an analysis of residuals to examine whether the underlying assumptions 
of the (normal) linear regression model are verified. Various plots of the residuals can be 
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used to inspect whether the assumptions on the errors {e;} are satisfied. Figure 5.8 gives two 
such plots. The first is a scatterplot of the residuals {e;} against the fitted values y;. When the 
model assumptions are valid, the residuals, as approximations of the model error, should 
behave approximately as iid normal random variables for each of the fitted values, with a 
constant variance. In this case we see no strong aberrant structure in this plot. The residuals 
are fairly evenly spread and symmetrical about the y = 0 line (not shown). The second plot 
is a quantile—quantile (or qq) plot. This is a useful way to check for normality of the error 
terms, by plotting the sample quantiles of the residuals against the theoretical quantiles 
of the standard normal distribution. Under the model assumptions, the points should lie 
approximately on a straight line. For the current case there does not seem to be an extreme 
departure from normality. Drawing a histogram or density plot of the residuals will also 
help to verify the normality assumption. The following code was used. 


plt.plot(fit.fittedvalues,fit.resid,'. 
plt.xlabel("fitted values") 


plt.ylabel("residuals") 
sm.qqplot(fit.resid) 





residuals 
Sample Quantiles 





—25 
155 160 165 170 175 180 185 190 195 -3 -2 -1 0 1 2 3 
fitted values Theoretical Quantiles 


Figure 5.8: Left: residuals against fitted values. Right: a qq plot of the residuals. Neither 
shows clear evidence against the model assumptions of constant variance and normality. 


5.6.6 Variable Selection 


Among the large number of possible explanatory variables, we wish to select those which 
best explain the observed responses. By eliminating redundant explanatory variables, we 
reduce the statistical error without increasing the approximation error, and thus reduce the 
(expected) generalization risk of the learner. 

In this section, we briefly present two methods for variable selection. They are illus- 
trated on a few variables from the data set birthwt discussed in Section 1.5.3.2. The data 
set contains information on the birth weights (masses) of babies, as well as various char- 
acteristics of the mother, such as whether she smokes, her age, etc. We wish to explain 
the child’s weight at birth using various characteristics of the mother, her family history, 
and her behavior during pregnancy. The response variable is weight at birth (quantitative 
variable bwt, expressed in grams); the explanatory variables are given below. 


rs 13 
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FORWARD 
SELECTION 


The data can be obtained as explained in Section 1.5.3.2, or from statsmode1s in the 
following way: 


bwt = sm.datasets.get_rdataset("birthwt","MASS").data 





Here is some information about the explanatory variables that we will investigate. 


age: mother's age in years 

lwt: mother's weight in lbs 

race: mother's race (1 = white, 2 = black, 3 = other) 
smoke: smoking status during pregnancy (0 = no, 1 = yes) 


ptl: no. of previous premature labors 

ht: history of hypertension (0 = no, 1 = yes) 

ul: presence of uterine irritability (0 = no, 1 = yes) 
ftv: no. of physician visits during first trimester 


bwt: birth weight in grams 


We can see the structure of the variables via bwt.info(). Check yourself that all 
variables are defined as quantitative (int64). However, the variables race, smoke, ht, 
and ui should really be interpreted as qualitative (factors). To fix this, we could redefine 
them with the method astype, similar to what we did in Chapter 1. Alternatively, we could 
use the CQ construction in a statsmodels formula to let the program know that certain 
variables are factors. We will use the latter approach. 


For binary features it does not matter whether the variables are interpreted as 


factorial or numerical as the numerical and summary results are identical. 





We consider the explanatory variables lwt, age, ui, smoke, ht, and two recoded binary 
variables ftv1 and ptl1. We define ftv1 = 1 if there was at least one visit to a physician, 
and ftv1 = 0 otherwise. Similarly, we define pt11 = 1 if there is at least one preterm birth 
in the family history, and pt11 = 0 otherwise. 


Cbwt['ftv']>=1).astype(Cint) 


Cbwt ['ptl']>=1).astype(Cint) 





5.6.6.1 Forward Selection and Backward Elimination 


The forward selection method is an iterative method for variable selection. In the first 
iteration we consider which feature f1 is the most significant in terms of its P-value in the 
models bwt~f1, with £1 € {lwt, age,...}. This feature is then selected into the model. In 
the second iteration, the feature £2 that has the smallest P-value in the models bwt~£1+ £2 
is selected, where £2 # £1, and so on. Usually only features are selected that have a P- 
value of at most 0.05. The following Python program automates this procedure. Instead of 
selecting on the P-value one could select on the AIC or BIC value. 
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forwardselection.py 


import statsmodels.api as sm 
from statsmodels.formula.api import ols 


bwt = sm.datasets.get_rdataset("birthwt","MASS").data 
ftvl = (Cbwt['ftv']>=1).astype(Cint) 
ptl1l = Cbwt['ptl']>=1).astype(Cint) 


remaining_features = {'lwt', 'age', 'C(Cui)', 'smoke', 
TECH) fo Givi Se pili: 
selected_features = [] 
while remaining_features: 
PF = [] #list of (P value, feature) 
for f in remaining_features: 
temp = selected_features + [f] #temporary list of features 
formula = 'bwt~' + '+'.join(temp) 
fit = ols(formula,data=bwt).fitQ 
pval= fit.pvalues[-1] 
if pval < 0.05: 
PF.append((pval, f)) 
if PF: #if not empty 
PF.sort(reverse=True) 
(best_pval, best_f) = PF.pop(Q 
remaining _features.remove(best_f) 
print('feature {} with P-value = {:.2E}'. 
format (best_f, best_pval)) 
selected_features.append(best_f) 
else: 
break 


feature C(ui) with P-value = 7.52E-05 
feature C(ht) with P-value 1.08E-02 
feature lwt with P-value = 6.01E-03 

feature smoke with P-value 7.27E-03 


In backward elimination we start with the complete model (all features included) and 
at each step, we remove the variable with the highest P-value, as long as it is not significant 
(greater than 0.05). We leave it as an exercise to verify that the order in which the fea- 
tures are removed is: age, ftv1, and pt11. In this case, forward selection and backward 
elimination result in the same model, but this need not be the case in general. 


This way of model selection has the advantage of being easy to use and of treating the 
question of variable selection in a systematic manner. The main drawback is that variables 
are included or deleted based on purely statistical criteria, without taking into account the 
aim of the study. This usually leads to a model which may be satisfactory from a statistical 
point of view, but in which the variables are not necessarily the most relevant when it comes 
to understanding and interpreting the data in the study. 


Of course, we can choose to investigate any combination of features, not just the ones 
suggested by the above variable selection methods. For example, let us see if the mother’s 
weight, her age, her race, and whether she smokes explain the baby’s birthweight. 
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5.6. Linear Models in Python 





formula = 'bwt~lwt+age+C(race)+ smoke' 


bwt_model = ols(formula, data=bwt).fitQ 





print (bwt_model.summary () ) 


OLS Regression Results 











Dep. Variable: bwt R-squared: 0.148 
Model: OLS Adj. R-squared: 9.125 
Method: Least Squares F-statistic: 6.373 
No. Observations: 189 Prob (F-statistic): 1.76e-05 
Df Residuals: 183 Log-Likelihood: -1498.4 
Df Model: 5 AIC: 3009. 

BIC: 3028. 

coef std err t P>|t| [0.025 0.975] 

Intercept 2839.4334 321.435 8.834 0.000 2205.239 3473.628 
C(race)[T.2] -510.5015 157.077 -3.250 0.001 -820.416 -200.587 
C(race)[T.3] -398.6439 119.579 -3.334 0.001 -634.575 -162.713 
smoke -401.7205 109.241 -3.677 0.000 -617.254 -186.187 
lwt 3.9999 1.738 2.301 0.022 0.571 7.429 
age -1.9478 9.820 -0.198 0.843 -21.323 17.427 
Omnibus: 3.916 Durbin-Watson: 0.458 
Prob(Omnibus): 0.141 Jarque-Bera (JB): 3.718 
Skew: -0.343 Prob(JB): 0.156 
Kurtosis: 3.038 Cond. No. 899. 





Given the result of Fisher’s global test given by Prob (F-Statistic) in the summary 
(P-value = 1.76 x 107°), we can conclude that at least one of the explanatory variables is 
associated with child weight at birth, after adjusting for the other variables. The individual 
Student tests indicate that: 


e the mother’s weight is linearly associated with child weight, after adjusting for age, 


race, and smoking status (P-value = 0.022). At the same age, race, and smoking 
status, an increase of one pound in the mother’s weight corresponds to an increase 
of 4 g in the average child weight at birth; 


the age of the mother is not significantly linearly associated with child weight at 
birth, when mother weight, race, and smoking status are already taken into account 
(P-value = 0.843); 


weight at birth is significantly lower for a child born to a mother who smokes, com- 
pared to children born to non-smoking mothers of the same age, race, and weight, 
with a P-value of 0.00031 (to see this, inspect bwt_model.pvalues). At the same 
age, race, and mother weight, the child’s weight at birth is 401.720 g less for a 
smoking mother than for a non-smoking mother; 


regarding the interpretation of the variable race, we note that the first level of this 
categorical variable corresponds to white mothers. The estimate of —510.501 g for 
C(race) [T.2] represents the difference in the child’s birth weight between black 
mothers and white mothers (reference group), and this result is significantly different 
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from zero (P-value = 0.001) in a model adjusted for the mother’s weight, age, and 
smoking status. 


5.6.6.2 Interaction 


We can also include interaction terms in the model. Let us see whether there is any inter- 
action effect between smoke and age via the model 


Bwt = By + Bj age + B.smoke + B3age x smoke + £. 


In Python this can be done as follows (below we have removed some output): 


formula = 'bwt~age*smoke' 
bwt_model = ols(formula, data=bwt).fitQ 
print (bwt_model .summary () ) 


OLS Regression Results 





Dep. Variable: R-squared: 0.069 
Model: OLS Adj. R-squared: 0.054 
Method: Least Squares F-statistic: 4.577 
No. Observations: 189 Prob (F-statistic): 0.00407 
Df Residuals: 183 Log-Likelihood: -1506.8 
Df Model: 5 AIC: 3009. 

BIC: 3028. 





Intercept 2406.1 292.190 
smoke 798.2 484.342 
age 27.7 12.149 
age: smoke -46.6 20.447 





We observe that the estimate for 63 (—46.6) is significantly different from zero (P-value 
= 0.024). We therefore conclude that the effect of the mother’s age on the child’s weight 
depends on the smoking status of the mother. The results on association between mother 
age and child weight must therefore be presented separately for the smoking and the non- 
smoking group. For non-smoking mothers (smoke = 9), the mean child weight at birth 
increases on average by 27.7 grams for each year of the mother’s age. This is statistically 
significant, as can be seen from the 95% confidence intervals for the parameters (which 
does not contain zero): 


bwt_model.conf_int () 


0 1 
Intercept 1829.605754 2982.510194 


age 3.762780 51.699977 
smoke -157.368023 1753.717779 
age : smoke -86.911405 -6.232425 





Similarly, for smoking mothers, there seems to be a decrease in birthweight, Bi + B = 
27.7 — 46.6 = —18.9, but this is not statistically significant; see Exercise 6. 
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5.7 Generalized Linear Models 


The normal linear model in Section 2.8 deals with continuous response variables — such 
as height and crop yield — and continuous or discrete explanatory variables. Given the 
feature vectors {x;}, the responses {Y;} are independent of each other, and each has a normal 
distribution with mean x; 8, where x; is the i-th row of the model matrix X. Generalized 
linear models allow for arbitrary response distributions, including discrete ones. 


Definition 5.2: Generalized Linear Model 


In a generalized linear model (GLM) the expected response for a given feature vec- 
tor x = [x),...,x,]' is of the form 


ELY |X =x] =h(x'B) (5.35) 


for some function h, which is called the activation function. The distribution of 
Y (for a given x) may depend on additional dispersion parameters that model the 
randomness in the data that is not explained by x. 





The inverse of function h is called the link function. As for the linear model, (5.35) is 
a model for a single pair (x, Y). Using the model simplification introduced at the end of 
Section 5.1, the corresponding model for a whole training set T = {(x;, Y;)} is that the {x;} 
are fixed and that the {Y;} are independent; each Y; satisfying (5.35) with x = x;. Writing 


Y =[Y,...,Y,]' and defining h as the multivalued function with components h, we have 
ExY = h(X), 

where X is the (model) matrix with rows Xia ...,x,. A common assumption is that 

Y,,...,Y, come from the same family of distributions, e.g., normal, Bernoulli, or Pois- 


son. The central focus is the parameter vector B, which summarizes how the matrix of 
explanatory variables X affects the response vector Y. The class of generalized linear mod- 
els can encompass a wide variety of models. Obviously the normal linear model (2.34) is 
a generalized linear model, with E[Y | X = x] = x'B, so that h is the identity function. In 
this case, Y ~ N(x", o°), i= 1,...,n, where co” is a dispersion parameter. 


E Example 5.10 (Logistic Regression) In a logistic regression or logit model, we as- 
sume that the response variables Y,,..., Y, are independent and distributed according to 
Y; ~ Ber(h(x/ B)), where h here is defined as the cdf of the logistic distribution: 


h(x) = 





1+e~* 
Large values of x; £ thus lead to a high probability that Y; = 1, and small (negative) values 
of x/B cause Y; to be O with high probability. Estimation of the parameter vector 8 from 
the observed data is not as straightforward as for the ordinary linear model, but can be 
accomplished via the minimization of a suitable training loss, as explained below. 
As the {Y;} are independent, the pdf of Y = [Y;,..., Y,]' is 
8018X = | [EAPN - rap. 


i=1 
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Maximizing the log-likelihood In g(y |8, X) with respect to B gives the maximum likeli- 
hood estimator of $. In a supervised learning framework, this is equivalent to minimizing: 


1 1 n 
-= Ing(y|B.X) =- — $ In g0; 18, x) 
i=l (5.36) 


n 


1 
=- = $ [yi nh(x7) + (1 - y) In ~ hex? A). 


i=1 


By comparing (5.36) with (4.4), we see that we can interpret (5.36) as the cross-entropy 
training loss associated with comparing a true conditional pdf f(y|x) with an approxima- 
tion pdf 2(y |8, x) via the loss function 


Loss( fO |x), g(v |B, x) := —Ing(y|B, x) = -yIn A(x’ B) - (1 - y) In(1 — h(x" p)). 


Minimizing (5.36) in terms of p actually constitutes a convex optimization problem. Since 
In h(x™B) = —In(1 + e*'F) and In(1 — A(x™B)) = —x™B — In(1 + e™®), the cross-entropy 
training loss (5.36) can be rewritten as 


n 


1 T 
(B) := = 1- y)xTB +ln(1 +e™ £). 
r-(B) PA yx] B + In (1 +e™®)] 
We leave it as Exercise 7 to show that the gradient Vr,(B) and Hessian H(£) of r-(8) are 
given by 


1 n 
Vr) = — ti- yi (5.37) 
i=] 


and 


1 n 
H(p) = a Xaa — Hi) XiX] » (5.38) 
il 


respectively, where u; := h(x} B). 

Notice that H(£) is a positive semidefinite matrix for all values of 6, implying the 
convexity of r,(B). Consequently, we can find an optimal £ efficiently; e.g., via Newton’s 
method. Specifically, given an initial value ù, for t = 1,2,..., iteratively compute 


B: =B- H'(B,_,) Vr(B,_1), (5.39) 


until the sequence Fp, 61,82... is deemed to have converged, using some pre-fixed con- 
vergence criterion. 

Figure 5.9 shows the outcomes of 100 independent Bernoulli random variables, where 
each success probability, (1 + exp(—(Bp +6x))) |, depends on x and By = —3, B, = 10. The 
true logistic curve is also shown (dashed line). The minimum training loss curve (red line) 
is obtained via the Newton scheme (5.39), giving estimates Bo = —2.66 and Bi = 10.08. 
The Python code is given below. 


ms 123 


t= 403 


= 409 
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Figure 5.9: Logistic regression data (blue dots), fitted curve (red), and true curve (black 
dashed). 


logregld.py 


import numpy as np 
import matplotlib.pyplot as plt 
from numpy.linalg import lstsq 


n 100 # sample size 

x = (2*np.random.rand(n)-1).reshape(n,1) # explanatory variables 
beta = np.array([-3, 10]) 

Xmat = np.hstack((np.ones((n,1)), x)) 

p = 1/(1 + np.exp(-Xmat @ beta)) 

y = np.random. binomial (1,p,n) # response variables 


# initial guess 
betat = lstsq((Xmat.T @ Xmat),Xmat.T @ y, rcond=None) [0] 


grad = np.array([2,1]) # gradient 


while (np.sum(np.abs(grad)) > 1e-5) : # stopping criteria 
mu = 1/(1l+np.exp(-Xmat @ betat)) 
# gradient 
delta = (mu - y).reshape(n,1) 
grad = np.sum(np.multiply( np.hstack((delta,delta)),Xmat), axis 
=0).T 
# Hessian 
H = Xmat.T @ np.diag(np.multiply(mu,(1-mu))) @ Xmat 
betat = betat - lstsq(H,grad,rcond=None) [0] 
print (betat) 


plt.plot(x,y, '.') # plot data 


xx = np.linspace(-1,1,40).reshape (40,1) 

XXmat = np.hstack( (np.ones((len(xx) ,1)), xx)) 

yy = 1/(1 + np.exp(-XXmat @ beta)) 

plt.plot(xx,yy,'r-') #true logistic curve 
yy = 1/(1 + np.exp(-XXmat @ betat)); 

plt.plot(xx,yy,'k--') 
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Further Reading 


An excellent overview of regression is provided in [33] and an accessible mathematical 
treatment of linear regression models can be found in [108]. For extensions to nonlinear 
regression we refer the reader to [7]. A practical introduction to multilevel/hierarchical 
models is given in [47]. For further discussion on regression with discrete responses (clas- 
sification) we refer to Chapter 7 and the further reading therein. On the important question 
of how to handle missing data, the classic reference is [80] (see also [85]) and a modern 
applied reference is [120]. 


Exercises 


1. Following his mentor Francis Galton, the mathematician/statistician Karl Pearson con- 
ducted comprehensive studies comparing hereditary traits between members of the same 
family. Figure 5.10 depicts the measurements of the heights of 1078 fathers and their 
adult sons (one son per father). The data is available from the book’s GitHub site as 
pearson. csv. 


N 
oa 


N 
oO 


Height Son (in) 








58 60 62 64 66 68 70 72 74 76 
Height Father (in) 





Figure 5.10: A scatterplot of heights from Pearson’s data. 


(a) Show that sons are on average | inch taller than the fathers. 


(b) We could try to “explain” the height of the son by taking the height of his father and 
adding 1 inch. The prediction line y = x + 1 (red dashed) is given Figure 5.10. The 
black solid line is the fitted regression line. This line has a slope less than 1, and 


demonstrates Galton’s “regression” to the average. Find the intercept and slope of the 
fitted regression line. 


2. For the simple linear regression model, show that the values for Bi and Bo that solve the 


ns 251 








208 Exercises 
equations (5.9) are: 
— Deri — X)O%i — y) 
= 5.40 
fi Xi — x)? ( ) 
Bo =- fix, (5.41) 


provided that not all x; are the same. 


3. Edwin Hubble discovered that the universe is expanding. If v is a galaxy’s recession ve- 
locity (relative to any other galaxy) and d is its distance (from that same galaxy), Hubble’s 
law states that 

v= Hd, 


where H is known as Hubble’s constant. The following are distance (in millions of light- 
years) and velocity (thousands of miles per second) measurements made on five galactic 
clusters. 


distance | 68 137 315 405 700 
velocity | 24 47 12.0 144 26.0 





State the regression model and estimate H. 


4. The multiple linear regression model (5.6) can be viewed as a first-order approximation 
of the general model 
Y = g(x) + €, (5.42) 


where Ee =0, Vars =o”, and g(x) is some known or unknown function of a d- 


dimensional vector x of explanatory variables. To see this, replace g(x) with its first-order 
Taylor approximation around some point x9 and write this as Bọ + x'B. Express Bo and B 
in terms of g and Xo. 


5. Table 5.6 shows data from an agricultural experiment where crop yield was measured 
for two levels of pesticide and three levels of fertilizer. There are three responses for each 
combination. 


Table 5.6: Crop yields for pesticide and fertilizer combinations. 








Fertilizer 
Pesticide Low Medium High 
No 3.23, 3.20, 3.16 2.99, 2.85, 2.77 5.72, 5.77, 5.62 





Yes 6.78, 6.73, 6.79 9.07, 9.09, 8.86 8.12, 8.04, 8.31 


(a) Organize the data in standard form, where each row corresponds to a single meas- 
urement and the columns correspond to the response variable and the two factor vari- 
ables. 
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(b) Let Yj, be the response for the k-th replication at level i for factor 1 and level j 
for factor 2. To assess which factors best explain the response variable, we use the 
ANOVA model 

Yijk = M+ Qi + Bj + Vij + Eijks (5.43) 
where ) a; = 118) = ÈY = Uj Vij = 9. Define B = [u, a1, @2,61, b2, B3; V1, V12; 
Y 135 Y21> V22, ¥23] '. Give the corresponding 18 x 12 model matrix. 

(c) Note that the parameters are linearly dependent in this case. For example, a2 = —a; 

and yı3 = —(Y11 + Yi2). To retain only 6 linearly independent variables consider the 


6-dimensional parameter vector B = [u, a1, 81, 62,11, Y12]'. Find the matrix M such 
that M£ = B. 


(d) Give the model matrix corresponding to B. 


6. Show that for the birthweight data in Section 5.6.6.2 there is no significant decrease 
in birthweight for smoking mothers. [Hint: create a new variable nonsmoke = 1—smoke, 
which reverses the encoding for the smoking and non-smoking mothers. Then, the para- 
meter f; + 63 in the original model is the same as the parameter 6, in the model 


Bwt = po + B,age + G.nonsmoke + 63age x nonsmoke + €. 
Now find a 95% for 83 and see if it contains zero. ] 
7. Prove (5.37) and (5.38). 


8. In the Tobit regression model with normally distributed errors, the response is modeled 
as: 


Y; = 


Zi if i Zi 
| TuS O Za NAB, oL), 


Ui, if Zi < Ui 
where the model matrix X and the thresholds u1,..., u, are given. Typically, u; = 0,i = 
1,...,n. Suppose we wish to estimate 0 := (8,0?) via the Expectation-Maximization 
method, similar to the censored data Example 4.2. Let y = [y,...,¥,]' be the vector 
of observed data. 


(a) Show that the likelihood of y is: 


8919) = | | ¢.20%- x78) x | | Kn - x7A)/0), 
where ©® is the cdf of the N(0, 1) distribution and y,» the pdf of the N(0, o°?) distribu- 
tion. 


(b) Let y and y be vectors that collect all y; > u; and y; = u;, respectively. Denote the 


corresponding matrix of predictors by X and X, respectively. For each observation 
yi = u; introduce a latent variable z; and collect these into a vector z. For the same 
indices i collect the corresponding u; into a vector c. Show that the complete-data 
likelihood is given by 


aly, Z|8) = 


> _ Ya _ 2 
Ip- XAÊ _ Iz SA iee 


(2n07)"/? exo 20? 20? 
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(c) For the E-step, show that, for a fixed 6, 


g(zly,8) = | | sly, 


where each g(z;| y, 0) is the pdf of the N((Xf);, a°) distribution, truncated to the in- 
terval (—o9, ci]. 


(d) For the M-step, compute the expectation of the complete log-likelihood 


Iy- Xgl? EIlZ - X$? 


n 2 A 
=) In o^ — 7 In(27) = 7o? 7o02 


Then, derive the formulas for B and o° that maximize the expectation of the complete 
log-likelihood. 


9. Dowload data set WomenWage. csv from the book’s website. This data set is a tidied-up 
version of the women’s wages data set from [91]. The first column of the data (hours) is 
the response variable Y. It shows the hours spent in the labor force by married women in 
the 1970s. We want to understand what factors determine the participation rate of women 
in the labor force. The predictor variables are: 


Table 5.7: Features for the women’s wage data set. 


Feature Description 
kidslt6 Number of children younger than 6 years. 
kidsge6 Number of children older than 6 years. 





age Age of the married woman. 
educ Number of years of formal education. 
exper Number of years of “work experience”. 


nwifeinc Non-wife income, that is, the income of the husband. 
expersq The square of exper, to capture any nonlinear relationships. 


We observe that some of the responses are Y = 0, that is, some women did not particip- 
ate in the labor force. For this reason, we model the data using the Tobit regression model, 
in which the response Y is given as: 


Z ~ N(XB,c'l,). 


Y, = Zi if Z; > 0 
' lo ifZ, <0’ 


With 0 = (B, 07), the likelihood of the data y = [y,...,yn]" is: 


gly 18) = [isso Poi m x) B) x [Tiy,=0 O((u; — x} B)/c), 


where ® is the standard normal cdf. In Exercise 8, we derived the EM algorithm for max- 
imizing the log-likelihood. 


(a) Write down the EM algorithm in pseudo code as it applies to this Tobit regression. 
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(b) Implement the EM algorithm pseudo code in Python. Comment on which factor you 
think is important in determining the labor participation rate of women living in the 
USA in the 1970s. 


10. Let P be a projection matrix. Show that the diagonal elements of P all lie in the interval 
[0, 1]. In particular, for P = XX* in Theorem 5.1, the leverage value p; := P; satisfies 
0 < pi < 1 for all i. 


11. Consider the linear model Y = X$ + € in (5.8), with X being the n x p model matrix 
and € having expectation vector 0 and covariance matrix o7I,,. Suppose that B_; is the 
least-squares estimate obtained by omitting the i-th observation, Y;; that is, 


ee i 
B; = argmin > -xp 


j#i 


where x! is the j-th row of X. Let Y= xB be the corresponding fitted value at x;. Also, 
define B; as the least-squares estimator of 8 based on the response data 


y® = [Y;, ...3 VeuY ean tera Y|": 
(a) Prove that B = B;; that is, the linear model obtained from fitting all responses except 


the i-th is the same as the one obtained from fitting the data Y”. 


(b) Use the previous result to verify that 
Y; - Yı = (Y; - Y;)/(1 — P), 


where P = XX* is the projection matrix onto the columns of X. Hence, deduce the 
PRESS formula in Theorem 5.1. 


12. Take the linear model Y = Xf + €,where X is an n x p model matrix, € = 0, and 
Cov(£) = o°I,. Let P = XX* be the projection matrix onto the columns of X. 


(a) Using the properties of the pseudo-inverse (see Definition A.2), show that PP' = P. 


(b) Let E = Y — Y be the (random) vector of residuals, where Y = PY. Show that the i-th 
residual has a normal distribution with expectation 0 and variance o(1 — P;) (that is, 
o? times 1 minus the i-th leverage). 


(c) Show that o? can be unbiasedly estimated via 


1 = 1 = 
S? := ——lY - Y|? = — IY - Xl. (5.44) 
n-p n-p 


[Hint: use the cyclic property of the trace as in Example 2.3.] 


13. Consider a normal linear model Y = X$ + €, where X is an n x p model matrix and 
e€ ~ N(0, o°1,). Exercise 12 shows that for any such model the i-th standardized residual 
E,/(o V1 — P;;) has a standard normal distribution. This motivates the use of the leverage 
P;; to assess whether the i-th observation is an outlier depending on the size of the i-th 
residual relative to y1 — P;;. A more robust approach is to include an estimate for o using 
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STUDENTIZED 
RESIDUAL 


Cook’s DISTANCE 


all data except the i-th observation. This gives rise to the studentized residual T;, defined 
as 
T= — Fi 
© Savi- Pi 
where S_; is an estimate of o obtained by fitting all the observations except the i-th and 
Ei = Y;- Y, is the i-th (random) residual. Exercise 12 shows that we can take, for example, 


1 = 
S?, = —_—_lY.; - Xil, (5.45) 
n-l-p 


where X_; is the model matrix X with the i-th row removed, is an unbiased estimator of 
o”. We wish to compute S$?, efficiently, using S? in (5.44), as the latter will typically be 
available once we have fitted the linear model. To this end, define u; as the i-th unit vector 
[0,...,0,1,0,...,0]", and let 





E; 
U; 


YO:=y- Y-F, ;=Y- io 
( yu TP, 


where we have used the fact that Y; — a = E;/(1 — Pj), as derived in the proof of The- 
orem 5.1. Now apply Exercise 11 to prove that 

s2 = (n-p)S? - E? /( =Po 
K n-p-l1 
14. Using the notation from Exercises 11—13, Cook’s distance for observation i is defined 
as a 
_I¥-Y |P 
= s sT 
It measures the change in the fitted values when the i-th observation is removed, relative to 
the residual variance of the model (estimated via $°). 


D;: 


By using similar arguments as those in Exercise 13, show that 
‘(1 - Pip S? 


It follows that there is no need to “omit and refit” the linear model in order to compute 
Cook’s distance for the i-th response. 


15. Prove that if we add an additional feature to the general linear model, then R?, the 
coefficient of determination, is necessarily non-decreasing in value and hence cannot be 
used to compare models with different numbers of predictors. 


16. Let X := [X),...,X,]' and u := [u,..., un]. In the fundamental Theorem C.9, we 
use the fact that if X; ~ N(u;, 1), i = 1,..., are independent, then ||X ||? has (per definition) 
a noncentral y? distribution. Show that ||X||? has moment generating function 

elel? (1-22) 


O20 t < 1/2, 


and so the distribution of ||X||? depends on u only through the norm ||u\I. 
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17. Carry out a logistic regression analysis on a (partial) wine data set classification prob- 
lem. The data can be loaded using the following code. 


from sklearn import datasets 
import numpy as np 
data = datasets. load_wine() 


data.data[:, [9,10]] 
np.array(data.target==1,dtype=np.uint) 
np.append(np.ones(len(X)).reshape(-1,1),X,axis=1) 





The model matrix has three features, including the constant feature. Instead of using 
Newton’s method (5.39) to estimate $, implement a simple gradient descent procedure 


B, = Bui = aVr,(B,_1), 


with learning rate œ = 0.0001, and run it for 10° steps. Your procedure should deliver three 
coefficients; one for the intercept and the rest for the explanatory variables. Solve the same 
problem using the Logit method of statsmodels.api and compare the results. 


18. Consider again Example 5.10, where we train the learner via the Newton iteration 
(5.39). If X' := [x1,...,%,] defines the matrix of predictors and u, := h(XB,), then the 
gradient (5.37) and Hessian (5.38) for Newton’s method can be written as: 


l 1 
Vr-(B,) = OX (Hy =y) and H($,) = -X'D,X, 


where D, := diag(u, © (1 — ,)) is a diagonal matrix. Show that the Newton iteration (5.39) 
can be written as the iterative reweighted least-squares method: 


B,= as On = XBY D,-10,1 - XP), 


where y,_; := XB,_, + D740 - 41) is the so-called adjusted response. [Hint: use the fact 
that (M'M)-!M?"z is the minimizer of ||M£ — z||?.] 


19. In multi-output linear regression, the response variable is a real-valued vector of di- 
mension, say, m. Similar to (5.8), the model can be written in matrix notation: 


E&i 


Y = XB+| : |, 





where: 


e Y is ann X m matrix of n independent responses (stored as row vectors of length m); 
e X is the usual n x p model matrix; 
e Bis an p X m matrix of model parameters; 


è &,...,&, E R” are independent error terms with E € = 0 and Eee’ = X. 
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We wish to learn the matrix parameters B and & from the training set {Y, X}. To this end, 
consider minimizing the training loss: 


“in (wv - XB) £! (Y — XB)") 
n 


t 357 where tr(-) is the trace of a matrix. 


(a) Show that the minimizer of the training loss, denoted B, satisfies the normal equa- 
tions: p 
X'XB-=X'Y. 
(b) Noting that 


(Y - XB)" (Y - XB) = ) eje7, 
i=1 


explain why 





sen XB) (Y — XB) 
a n 
is a method-of-moments estimator of X, just like the one given in (5.10). 


CHAPTER 6 





REGULARIZATION AND KERNEL 
METHODS 


The purpose of this chapter is to familiarize the reader with two central concepts 
in modern data science and machine learning: regularization and kernel methods. Reg- 
ularization provides a natural way to guard against overfitting and kernel methods of- 
fer a broad generalization of linear models. Here, we discuss regularized regression 
(ridge, lasso) as a bridge to the fundamentals of kernel methods. We introduce repro- 
ducing kernel Hilbert spaces and show that selecting the best prediction function in 
such spaces is in fact a finite-dimensional optimization problem. Applications to spline 
fitting, Gaussian process regression, and kernel PCA are given. 


6.1 Introduction 


In this chapter we return to the supervised learning setting of Chapter 5 (regression) and ex- 
pand its scope. Given training data T = {(X1, y1),---,(Xn,Yn)}, we wish to find a prediction 
function (the learner) g, that minimizes the (squared-error) training loss 


1 n 
Ll8) = — Di- gd? 
i=1 


within a class of functions G. As noted in Chapter 2, if G is the set of all possible functions 
then choosing any function g with the property that g(x;) = y; for all i will give zero training 
loss, but will likely have poor generalization performance (that is, suffer from overfitting). 

Recall from Theorem 2.1 that the best possible prediction function (over all g) for 
the squared-error risk E(Y — g(X)} is given by g*(x) = E[Y | X = x]. The class G should 
be simple enough to permit theoretical understanding and analysis but, at the same time, 
rich enough to contain the optimal function g* (or a function close to g*). This ideal can 
be realized by taking G to be a Hilbert space (i.e., a complete inner product space) of 
functions; see Appendix A.7. 

Many of the classes of functions that we have encountered so far are in fact Hilbert 
spaces. In particular, the set G of linear functions on R? is a Hilbert space. To see this, 
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identify with each element $ € R” the linear function gg : x +> x'B and define the inner 
product on G as (gg, g,) := B'y. In this way, G behaves in exactly the same way as (is 
isomorphic to) the space R?” equipped with the Euclidean inner product (dot product). The 
latter is a Hilbert space, because it is complete with respect to the Euclidean norm. See 
Exercise 12 for a further discussion. 

Let us now turn to our “running” polynomial regression Example 2.1, where the feature 
vector x = [l,u,u’,...,u?-']" =: (u) is itself a vector-valued function of another feature 
u. Then, the space of functions hg : u œ> $(u)'£ is a Hilbert space, through the identifica- 
tion hg = $. In fact, this is true for any feature mapping ¢ : u > [¢)(u),...,¢,(w)]". 

This can be further generalized by considering feature maps u +> k,, where each xk, 
is a real-valued function v + k,(v) on the feature space. As we shall soon see (in Sec- 
tion 6.3), functions of the form u + °°, B;k,,(u) live in a Hilbert space of functions called 
a reproducing kernel Hilbert space (RKHS). In Section 6.3 we introduce the notion of a 
RKHS formally, give specific examples, including the linear and Gaussian kernels, and de- 
rive various useful properties, the most important of which is the representer Theorem 6.6. 
Applications of such spaces include the smoothing splines (Section 6.6), Gaussian pro- 
cess regression (Section 6.7), kernel PCA (Section 6.8), and support vector machines for 
classification (Section 7.7). 

The RKHS formalism also makes it easier to treat the important topic of regularization. 
The aim of regularization is to improve the predictive performance of the best learner in 
some class of functions G by adding a penalty term to the training loss that penalizes 
learners that tend to overfit the data. In the next section we introduce the main ideas behind 
regularization, which then segues into a discussion of kernel methods in the subsequent 
sections. 


6.2 Regularization 


Let G be the Hilbert space of functions over which we search for the minimizer, g,, of the 
training loss ¢,(g). Often, the Hilbert space G is rich enough so that we can find a learner 
g- within G such that the training loss is zero or close to zero. Consequently, if the space of 
functions G is sufficiently rich, we run the risk of overfitting. One way to avoid overfitting 
is to restrict attention to a subset of the space G by introducing a non-negative functional 
J : G — R, which penalizes complex models (functions). In particular, we want to find 
functions g € G such that J(g) < c for some “regularization” constant c > 0. Thus we can 
formulate the quintessential supervised learning problem as: 


min {é-(g) : g€G,J(g)<c}, (6.1) 


the solution (argmin) of which is our learner. When this optimization problem is convex, it 
can be solved by first obtaining the Lagrangian dual function 


L(A) := mn El) + A8) — 0), 


and then maximizing L£*(A) with respect to A > 0; see Section B.2.3. 

In order to introduce the overall ideas of kernel methods and regularization, we will 
proceed by exploring (6.1) in the special case of ridge regression, with the following run- 
ning example. 


Chapter 6. Regularization and Kernel Methods 


217 





m Example 6.1 (Ridge Regression) Ridge regression is simply linear regression with a 
squared-norm penalty functional (also called a regularization function, or regularizer). 
Suppose we have a training set T = {(x;, yj), = 1,...,n}, with each x; € R? and we use a 
squared-norm penalty with regularization parameter y > 0. Then, the problem is to solve 


1 n 
min — X Oi (xd) + yllgl?, (6.2) 
i=1 


seg n — 


where G is the Hilbert space of linear functions on R’. As explained in Section 6.1, we 
can identify each g € G with a vector B € R? and, consequently, ||g||? = (B, B) = ||B||’. The 
above functional optimization problem is thus equivalent to the parametric optimization 
problem 


1 n 
min — > 0; -x78 +y IBP, (6.3) 
i=1 


BR n 


which, in the notation of Chapter 5, further simplifies to 


min + [ly - X$ |? + yll. (6.4) 
B&R n 
In other words, the solution to (6.2) is of the form x > x'ß*, where B solves (6.3) (or 
equivalently (6.4)). Observe that as y — ov, the regularization term becomes dominant and 
consequently the optimal g becomes identically zero. 

The optimization problem in (6.4) is convex, and by multiplying by the constant n/2 
and setting the gradient equal to zero, we obtain 


X"(XB-y)+nyB=0. (6.5) 


If y = 0 these are simply the normal equations, albeit written in a slightly different form. 
If the matrix X'X + n yI, is invertible (which is the case for any y > 0; see Exercise 13), 
then the solution to these modified normal equations is 


B=(X"X +nyl,)'X'y. 
|_| 


When using regularization with respect to some Hilbert space G, it is sometimes useful 
to decompose G into two orthogonal subspaces, H and C say, such that every g € G can 
be uniquely written as g = h + c, with h € H, c € C, and <h, c) = 0. Such a G is said to be 
the direct sum of C and H, and we write G = H @C. Decompositions of this form become 
useful when functions in H are penalized but functions in C are not. We illustrate this 
decomposition with the ridge regression example where one of the features is a constant 
term, which we do not wish to penalize. 


E Example 6.2 (Ridge Regression (cont.)) Suppose one of the features in Example 6.1 
is the constant 1, which we do not wish to penalize. The reason for this is to ensure that 
when y — ov, the optimal g becomes the “constant” model, g(x) = 6o, rather than the 
“zero” model, g(x) = 0. Let us alter the notation slightly by considering the feature vectors 
to be of the form x = [1,x"]", where x = [x1,...,x,]'. We thus have p + 1 features, rather 
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GRAM MATRIX 


than p. Let G be the space of linear functions of x. Each linear function g of x can be 
written as g : x } Bo + x'B, which is the sum of the constant function c : x > Bo and 
h:x t+ x'B. Moreover, the two functions are orthogonal with respect to the inner product 
on G : (c,h) = [Bo,0"][0,B']" = 0, where 0 is a column vector of zeros. 

As subspaces of G, both C and H are again Hilbert spaces, and their inner products and 
norms follow directly from the inner product on G. For example, each function h : x > 
x' Bin H has norm ||/||z, = ||B||, and the constant function c : ¥ > Bo in C has norm |p. 

The modification of the regularized optimization problem (6.2) where the constant term 
is not penalized can now be written as 


_ 1x 2 2 
T i` i ’ 6.6 
me Le g(xi)) + y lieli (6.6) 
which further simplifies to 
_ 1 
min — || y — Bol — XB’ + yIBIP, (6.7) 
bob N 


where 1 is the n x 1 vector of 1s. Observe that, in this case, as y — oo the optimal g tends to 
the sample mean y of the {y;}; that is, we obtain the “default” regression model, without ex- 
planatory variables. Again, this is a convex optimization problem, and the solution follows 
from 
X' (bol + XB -y)+nyB =0, (6.8) 
with 
npo = 1' O - XB). (6.9) 


This results in solving for B from 
(XTX -n'X"11'X +nyI,B = X" -nX y, (6.10) 


and determining £o from (6.9). 

As a precursor to the kernel methods in the following sections, let us assume that n > p 
and that X has full (column) rank p. Then any vector B € R?” can be written as a linear 
combination of the feature vectors {x;}; that is, as linear combinations of the columns of 
the matrix X”. In particular, let 8 = X'a@, where a = [q,...,a@,]" € R”. In this case (6.10) 
reduces to 

XX” —n7!11°XXT +nyl,a = (l, —n7'11")y. 


Assuming invertibility of (XXT — n!11'XX" +nyI,), we have the solution 
@ = (XX° —n 11° XX" + nyl,) (L, -n' 1 )Dy, 


which depends on the training feature vectors {x;} only through the n x n matrix of inner 
products: XX" = [(x;,xj)]. This matrix is called the Gram matrix of the {x;}. From (6.9), 
the solution for the constant term is Bo = n'1"(y — XX‘Q). It follows that the learner is a 
linear combination of inner products {(x;, x)} plus a constant: 


g-(@) = Bo +x'X'@= Bo + a (xi, X), 


i=1 
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where the coefficients Bo and @; only depend on the inner products {(x;,x;)}. We will see 
shortly that the representer Theorem 6.6 generalizes this result to a broad class of regular- 
ized optimization problems. E 


We illustrate in Figure 6.1 how the solutions of the ridge regression problems appearing 
in Examples 6.1 and 6.2 are qualitatively affected by the regularization parameter y for a 
simple linear regression model. The data was generated from the model y; = —1.5 +0.5x; + 
E; i = 1,..., 100, where each x; is drawn independently and uniformly from the interval 
[0, 10] and each s; is drawn independently from the standard normal distribution. 


























Figure 6.1: Ridge regression solutions for a simple linear regression problem. Each panel 
shows contours of the loss function (log scale) and the effect of the regularization parameter 
y € {0.1, 1, 10}, appearing in (6.4) and (6.7). Top row: both terms are penalized. Bottom 
row: only the non-constant term is penalized. Penalized (plus) and unpenalized (diamond) 
solutions are shown in each case. 


The contours are those of the squared-error loss (actually the logarithm thereof), which 
is minimized with respect to the model parameters £o and 61. The diamonds all repres- 
ent the same minimizer of this loss. The plusses show each minimizer [65,6;]' of the 
regularized minimization problems (6.4) and (6.7) for three choices of the regularization 
parameter y. For the top three panels the regularization involves both 6ọ and £6, through 
the squared norm Bs + B. The circles show the points that have the same squared norm as 
the optimal solution. For the bottom three panels only 6o is regularized; there, horizontal 
lines indicate vectors [8o,81]' for which |6;| = |6ţl. 
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The problem of ridge regression discussed in Example 6.2 boils down to solving a 
problem of the form in (6.7), involving a squared 2-norm penalty ||6||?. A natural ques- 
tion to ask is whether we can replace the squared 2-norm penalty by a different penalty 
term. Replacing it with a l-norm gives the lasso (least absolute shrinkage and selection 
operator). The lasso equivalent of the ridge regression problem (6.7) is thus: 


1 
min — || y — Bol - X£ |? + y lllh, (6.11) 
bob N 


where ||B||; = X2 lil. 

This is again a convex optimization problem. Unlike ridge regression, the lasso gener- 
ally does not have an explicit solution, and so numerical methods must be used to solve it. 
Note that the problem (6.11) is of the form 

min f(x) + g(Z) 
we (6.12) 
subject to Ax+Bz=c, 


with x := [6,B']', z:=B, A := [0,,1,], B := —I,, and c := 0, (vector of zeros), and 
convex functions f(x) := ‘| y —[1,,X]x|?? and g(z) := yllzll;. There exist efficient al- 
gorithms for solving such problems, including the alternating direction method of mul- 
tipliers (ADMM) [17]. We refer to the Appendix B.29 for details on this algorithm. 

We repeat the examples from Figure 6.1, but now using lasso regression and taking 
the square roots of the previous regularization parameters. The results are displayed in 
Figure 6.2. 









































Bo Bo Bo 


Figure 6.2: Lasso regression solutions. Compare with Figure 6.1. 
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One advantage of using the lasso regularization is that the resulting optimal parameter 
vector often has several components that are exactly 0. For example, in the top middle 
and right panels of Figure 6.2, the optimal solution lies exactly at a corner point of the 
square {[o,81]' : [Bol + 61l = |BG| + |Gi|}; in this case £ = 0. For statistical models with 
many parameters, the lasso can provide a methodology for model selection. Namely, as the 
regularization parameter increases (or, equivalently, as the L; norm of the optimal solution 
decreases), the solution vector will have fewer and fewer non-zero parameters. By plotting 
the values of the parameters for each y or Lı one obtains the so-called regularization paths 
(also called homotopy paths or coefficient profiles) for the variables. Inspection of such 
paths may help assess which of the model parameters are relevant to explain the variability 
in the observed responses {y;}. 


E Example 6.3 (Regularization Paths) Figure 6.3 shows the regularization paths for p = 
60 coefficients from a multiple linear regression model 


60 
Y; = X Bixi + £i i= 1,..., 150, 
j=1 


where 8; = 1 for j = 1,...,10 and £; = 0 for j = 11,...,60. The error terms {¢;} are inde- 
pendent and standard normal. The explanatory variables {x;;} were independently generated 
from a standard normal distribution. As it is clear from the figure, the estimates of the 10 
non-zero coefficients are first selected, as the L; norm of the solutions increases. By the 
time the Lı norm reaches around 4, all 10 variables for which 6; = 1 have been correctly 
identified and the remaining 50 parameters are estimated as exactly 0. Only after the Lı 
norm reaches around 8, will these “spurious” parameters be estimated to be non-zero. For 
this example, the regularization parameter y varied from 10~* to 10. 




















Lı norm 


Figure 6.3: Regularization paths for lasso regression solutions as a function of the L; norm 
of the solutions. 
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6.3 Reproducing Kernel Hilbert Spaces 


In this section, we formalize the idea outlined at the end of Section 6.1 of extending finite 
dimensional feature maps to those that are functions by introducing a special type of Hil- 
bert space of functions known as a reproducing kernel Hilbert space (RKHS). Although 
the theory extends naturally to Hilbert spaces of complex-valued functions, we restrict 
attention to Hilbert spaces of real-valued functions here. 

To evaluate the loss of a learner g in some class of functions G, we do not need to expli- 
citly construct g — rather, it is only required that we can evaluate g at all the feature vectors 
X\,...,X, Of the training set. A defining property of an RKHS is that function evaluation 
at a point x can be performed by simply taking the inner product of g with some feature 
function x, associated with x. We will see that this property becomes particularly useful 
in light of the representer theorem (see Section 6.5), which states that the learner g itself 
can be represented as a linear combination of the set of feature functions {k,,,i = 1,...,n}. 
Consequently, we can evaluate a learner g at the feature vectors {x;} by taking linear com- 
binations of terms of the form x(x;,x;) = (Kx; Kx;}g. Collecting these inner products into 
a matrix K = [k(x;,x;),i, j = 1,...,n] (the Gram matrix of the {x,,}), we will see that the 
feature vectors {x;} only enter the loss minimization problem through K. 


Definition 6.1: Reproducing Kernel Hilbert Space 


For a non-empty set X, a Hilbert space G of functions g : X — R with inner product 
C, -)g is called a reproducing kernel Hilbert space (RKHS) with reproducing kernel 
K: XXX Rif: 


1. for every x € X, ky := K(x, :) is in G, 


2. K(x,x) < œ forall x € X, 


3. for every x € X and g € G, g(x) = (8, Kx)g- 





The reproducing kernel of a Hilbert space of functions, if it exists, is unique; see Exer- 
cise 2. The main (third) condition in Definition 6.1 is known as the reproducing property. 
This property allows us to evaluate any function g € G at a point x € X by taking the inner 
product of g and Kx; as such, Ky is called the representer of evaluation. Further, by taking 
g = ky and applying the reproducing property, we have (Kw, Kx)g = K(x, x’), and so by sym- 
metry of the inner product it follows that x(x, x’) = k(x’, x). As a consequence, reproducing 
kernels are necessarily symmetric functions. Moreover, a reproducing kernel x is a positive 
semidefinite function, meaning that for every n > 1 and every choice of a),...,a@, € Rand 
X1,...,X, € X, it holds that 


n n 
DY) > K(X, x) 07 > O. (6.13) 
i=l j=l 
In other words, every Gram matrix K associated with x is a positive semidefinite matrix; 
that is @' Ka > 0 for all œ. The proof is addressed in Exercise 1. 
The following theorem gives an alternative characterization of an RKHS. The proof 
uses the Riesz representation Theorem A.17. Also note that in the theorem below we could 
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have replaced the word “bounded” with “continuous”, as the two are equivalent for linear 
functionals; see Theorem A.16. 


Theorem 6.1: Continuous Evaluation Functionals Characterize a RKHS 





Proof: Note that, since evaluation functionals 6, are linear operators, showing bounded- 
ness is equivalent to showing continuity. Given an RKHS with reproducing kernel x, sup- 
pose that we have a sequence g, € G converging to g € G, that is ||g, — gllg — 0. We apply 
the Cauchy—Schwarz inequality (Theorem A.15) and the reproducing property of x to find 
that for every x € X and any n: 


lOx8n = x8 = 12n(xX) = g(x)| = Ken — 8, Kx)g| < ln ad alle lkxllg = ln E alle V {Kx Kx)G 
= |I8n — alle VK, x). 


Noting that x(x, x) < co by definition for every x € X, and that ||g, — gllg —> 0 as n > ~, 
we have shown continuity of ôy, that is |6,g, — ôxg| —> 0 as n — oo for every x € X. 
Conversely, suppose that evaluation functionals are bounded. Then from the Riesz 
representation Theorem A.17, there exists some gs, € G such that 6,g = (g, 85,)g for all 
g € G — the representer of evaluation. If we define x(x, x’) = gs5,(x’) for all x, x’ € X, then 
Ky != K(x, +) = gs, is an element of G for every x € X and (g, ky)g = ôxg = g(x), so that the 
reproducing property in Definition 6.1 is verified. Oo 


The fact that an RKHS has continuous evaluation functionals means that if two func- 
tions g,h € G are “close” with respect to || - |ç, then their evaluations g(x), h(x) are close 
for every x € X. Formally, convergence in ||- ||g norm implies pointwise convergence for 
all x € X. 

The following theorem shows that any finite function k : X X X — R can serve as a 
reproducing kernel as long as it is finite, symmetric, and positive semidefinite. The cor- 
responding (unique!) RKHS G is the completion of the set of all functions of the form 
D-i Qi Ky, Where a; € R for alli = 1,...,n. 


Theorem 6.2: Moore—Aronszajn 





Proof: (Sketch) As the proof of uniqueness is treated in Exercise 2, the objective is to 
prove existence. The idea is to construct a pre-RKHS Go from the given function x that has 
the essential structure and then to extend Gp to an RKHS G. 

In particular, define Go as the set of finite linear combinations of functions Ky, x € X: 


Go := fs = ee 
i=1 
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LINEAR KERNEL 


Define on Go the following inner product: 


n m 


eia (> Œi Ks» Y Bi <) = ai Bj K(Xi, 4). 
i=l j=l G 


j i=l j=l 


Then Gp is an inner product space. In fact, Go has the essential structure we require, namely 
that (i) evaluation functionals are bounded/continuous (Exercise 4) and (ii) Cauchy se- 
quences in Gp that converge pointwise also converge in norm (see Exercise 5). 

We then enlarge Go to the set G of all functions g : X — R for which there exists a 
Cauchy sequence in Gp converging pointwise to g and define an inner product on G as the 
limit 

(fs Bg = lim fas 8n)Go> (6.14) 


where f, — f and g, — g. To show that G is an RKHS it remains to be shown that (1) this 
inner product is well defined; (2) evaluation functionals remain bounded; and (3) the space 
G is complete. A detailed proof is established in Exercises 6 and 7. Oo 


6.4 Construction of Reproducing Kernels 


In this section we describe various ways to construct a reproducing kernel k: X x X > 
R for some feature space X. Recall that x needs to be a finite, symmetric, and positive 
semidefinite function (that is, it satisfies (6.13)). In view of Theorem 6.2, specifying the 
space X and a reproducing kernel k : X X X — R corresponds to uniquely specifying an 
RKHS. 


6.4.1 Reproducing Kernels via Feature Mapping 


Perhaps the most fundamental way to construct a reproducing kernel x is via a feature 
map @ : X — R”. We define k(x, x’) := (6(x), 6(x’)), where < , ) denotes the Euclidean 
inner product. The function is clearly finite and symmetric. To verify that x is positive 
semidefinite, let ® be the matrix with rows @(x,)',...,@(x,)' and let a = [a),...,a@,]" € 
R”. Then, 


5 ` ai K(X Xj) a; = 3 5 aip xpa) aj =a OO" a = ||P" al? > 0. 


i=l j=l i=1 j=l 


E Example 6.4 (Linear Kernel) Taking the identity feature map ¢(x) = x on X = RP, 
gives the linear kernel 
K(x, x’) = (x, x) = xx. 


As can be seen from the proof of Theorem 6.2, the RKHS of functions corresponding to 
the linear kernel is the space of linear functions on R”. This space is isomorphic to R? 
itself, as discussed in the introduction (see also Exercise 12). Oo 


It is natural to wonder whether a given kernel function corresponds uniquely to a feature 
map. The answer is no, as we shall see by way of example. 
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E Example 6.5 (Feature Maps and Kernel Functions) Let X = R and consider feature 
maps ¢, : X > Rand ¢, : X > R’, with (x) := x and (x) := [x, x]" /v2. Then 
ka (x, x") = (1 (x), G1 (")) = xX", 
but also 
Kg (x, x) = (P(x), px) = xX". 
Thus, we arrive at the same kernel function defined for the same underlying set X via two 
different feature maps. E 
6.4.2 Kernels from Characteristic Functions 
Another way to construct reproducing kernels on X = R? makes use of the properties of 
characteristic functions. In particular, we have the following result. We leave its proof as rs 44] 
Exercise 10. 
Theorem 6.3: Reproducing Kernel from a Characteristic Function 
m Example 6.6 (Gaussian Kernel) The multivariate normal distribution with mean vec- 
tor 0 and covariance matrix b? I, is clearly symmetric around the origin. Its characteristic 
function is j 
W(t) = exp (-30" mP), te R. 
Taking b? = 1/o”, this gives the popular Gaussian kernel on RP: GAUSSIAN 
KERNEL 
1 yi2 
K(x, x’) = exp P ad . (6.15) 
2 P 
The parameter o is sometimes called the bandwidth. Note that in the machine learning BANDWIDTH 


literature, the Gaussian kernel is sometimes referred to as “the” radial basis function (rbf) 
kernel.' 

From the proof of Theorem 6.2, we see that the RKHS G determined by the Gaussian 
kernel x is the space of pointwise limits of functions of the form 


Z pE) 
g(x) = ) wexp(-5 L . 
= 2 oc 


We can think of each point x; having a feature x,, that is a scaled multivariate Gaussian pdf 
centered at x;. E 





'The term radial basis function is sometimes used more generally to mean kernels of the form x(x, x’) = 
f(x — x’||) for some function f : R > R. 
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m Example 6.7 (Sinc Kernel) The characteristic function of a Uniform[—1,1] random 
variable (which is symmetric around 0) is Y(t) = sinc(t) := sin(t)/t, so K(x, x’) = sinc(x—x’) 
is a valid kernel. oO 


Inspired by kernel density estimation (Section 4.4), we may be tempted to use the pdf 
of a random variable that is symmetric about the origin to construct a reproducing kernel. 
However, doing so will not work in general, as the next example illustrates. 


E Example 6.8 (Uniform pdf Does not Construct a Valid Reproducing Kernel) Take 
the function W(t) = + 1{ld < 1}, which is the pdf of X ~ Uniform[-—1, 1]. Unfortunately, the 
function k(x, x’) = W(x — x’) is not positive semidefinite, as can be seen for example by 
constructing the matrix A = [k(¢;,t;),i, 7 = 1,2,3] for the points t; = 0, t = 0.75, and 
t3 = 1.5 as follows: 


WO) w(-0.75) w-1.5) 0.5 05 0 
A=]wW(0.75) = WO) kco) = bs 0.5 03} 
WA1.5) (0.75) W(0) 0 0.5 0.5 
The eigenvalues of A are {1/2 — y1I/2,1/2,1/2 + VI/2} ~ {—0.2071,0.5, 1.2071} and so 
by Theorem A.9, A is not a positive semidefinite matrix, since it has a negative eigenvalue. 
Consequently, x is not a valid reproducing kernel. E 


One of the reasons why the Gaussian kernel (6.15) is popular is that it enjoys the uni- 
versal approximation property [88]: the space of functions spanned by the Gaussian kernel 
is dense in the space of continuous functions with support Z c R”. Naturally, this is a 
desirable property especially if there is little prior knowledge about the properties of g*. 
However, note that every function g in the RKHS G associated with a Gaussian kernel k is 
infinitely differentiable. Moreover, a Gaussian RKHS does not contain non-zero constant 
functions. Indeed, if A C Z is non-empty and open, then the only function of the form 
g(x) = c 1{x € A} contained in G is the zero function (c = 0). 

Consequently, if it is known that g is differentiable only to a certain order, one may 
prefer the Matérn kernel with parameters v, o > 0: 





Qlv y 
Kax) = TH (v2y |x - x'II/or) K, (V2vllx - x'lI/o), (6.16) 


which gives functions that are (weakly) differentiable to order |y] (but not necessarily to 
order [v]). Here, K, denotes the modified Bessel function of the second kind; see (4.49). 
The particular form of the Matérn kernel appearing in (6.16) ensures that lim,_,.. K(x, x’) = 
k(x, x’), where x is the Gaussian kernel appearing in (6.15). 

We remark that Sobolev spaces are closely related to the Matérn kernel. Up to constants 
(which scale the unit ball in the space), in dimension p and for a parameter s > p/2, these 
spaces can be identified with y(t) = 2 |Itll?/7Kj2-s(lltll), which in turn can be viewed as 
the characteristic function corresponding to the (radially symmetric) multivariate Student’s 
t distribution with s degrees of freedom: that is, with pdf f(x) (1 + |x". 
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6.4.3 Reproducing Kernels Using Orthonormal Features 


We have seen in Sections 6.4.1 and 6.4.2 how to construct reproducing kernels from feature 
maps and characteristic functions. Another way to construct kernels on a space X is to work 
directly from the function class L?(X; u); that is, the set of square-integrable” functions 
on X with respect to u; see also Definition A.4. For simplicity, in what follows, we will 
consider u to be the Lebesgue measure, and will simply write L7(X) rather than L?(X; u). 
We will also assume that X C R°. 

Let {€,,&,...} be an orthonormal basis of L?(X) and let c1, c2,... be a sequence of 
positive numbers. As discussed in Section 6.4.1, the kernel corresponding to a feature map 
$ : X > R’ is k(x, x’) = O(x)' A(x’) = ae ġi(x) 6(x’). Now consider a (possibly infinite) 
sequence of feature functions ¢; = c; €;,i = 1,2,... and define 


K(x, x’) = X i(x) il’) = Y A E(x) Ee’), (6.17) 
izl izl 
where A; = oe = 1,2,.... This is well-defined as long as }};>1 A; < œ, which we assume 


from now on. Let H be the linear space of functions of the form f = >);;; aié;, where 
Èi @7/A; < œ. As every function f € L?(X) can be represented as f = Zip (f, EDéi we 
see that H is a linear subspace of L?(X). On H define the inner product 


(f, 2H t= Lo 


i>l 


With this inner product, the squared norm of f = };ip1 a; &; is ||f lee = are Ap < œ. 
We show that H is actually an RKHS with kernel x by verifying the conditions of Defini- 
tion 6.1. First, 
Ke = J AEE EH, 
i>l 
as >); A; < œ by assumption, and so x is finite. Second, the reproducing property holds. 
Namely, let f = >';5, a; éi. Then, 


(Kx, Sif Ei) Ai F(X) a; 

(kes ft = 2, nn 21 i 2 aig(x) = f(x). 
The discussion above demonstrates that kernels can be constructed via (6.17). In fact, 
(under mild conditions) any given reproducing kernel x can be written in the form (6.17), 
where this series representation enjoys desirable convergence properties. This result is 
known as Mercer’s theorem, and is given below. We leave the full proof including the 
precise conditions to, e.g., [40], but the main idea is that a reproducing kernel x can be 
thought of as a generalization of a positive semidefinite matrix K, and can also be writ- 
ten in spectral form (see also Section A.6.5). In particular, by Theorem A.9, we can write 
K = VDV', where V is a matrix of orthonormal eigenvectors [v-] and D the diagonal 

matrix of the (positive) eigenvalues [J]; that is, 


KG, j) = X Ae vali) vel). 


t1 





2A function f : X > R is said to be square-integrable if f f°) (dx) < co, where u is a measure on X. 
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In (6.18) below, x, x’ play the role of i, j, and £ plays the role of vz. 


Theorem 6.4: Mercer 





Theorem 6.4 holds if (i) the kernel x is continuous on X x X, (ii) the function K(x) := 
k(x, x) defined for x € X is integrable. Extensions of Theorem 6.4 to more general spaces 
X and measures u hold; see, e.g., [115] or [40]. 

The key importance of Theorem 6.4 lies in the fact that the series representation (6.18) 
converges absolutely and uniformly on X xX. The uniform convergence is a much stronger 
condition than pointwise convergence, and means for instance that properties of the se- 
quence of partial sums, such as continuity and integrability, are transferred to the limit. 


E Example 6.9 (Mercer) Suppose X = [-1, 1] and the kernel is x(x, x’) = 1 + xx’ which 
corresponds to the RKHS G of affine functions from X — R. To find the (eigenvalue, 
eigenfunction) pairs for the integral operator appearing in Theorem 6.4, we need to find 
numbers {A;} and orthonormal functions {&;(x)} that solve 


1 
f (1 + xx’) Ep(x’) dx’ = Ag E(x), forall x € [-1, 1]. 
-1 


Consider first a constant function é (x) = c. Then, for all x € [-1, 1], we have that 2c = 4c, 
and the normalization condition requires that f 3 c? dx = 1. Together, these give 2; = 2 and 
c = +1/V2. Next, consider an affine function €)(x) = a + bx. Orthogonality requires that 


1 
i c(a + bx) dx = 0, 


1 


which implies a = 0 (since c # 0). Moreover, the normalization condition then requires 


1 
f bx dx = 1, 
-1 


or, equivalently, 2b?/3 = 1, implying b = + y3/2. Finally, the integral equation reads 


E 2b 
f (1 + xx’) bx’ dx = a4 bx > = = Aybx, 
-1 
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implying that A, = 2/3. We take the positive solutions (i.e., c > 0 and b > 0), and note that 
1 1 23 v3 
WEDE) + Ag Ea(x) EX) = 2 + x a’ = 1 + xx’ = K(x, x’), 
a V2 v2 3 V2 V2 
and so we have found the decomposition appearing in (6.18). As an aside, observe that &; 
and é are orthonormal versions of the first two Legendre polynomials. The corresponding r= 387 


feature map can be explicitly identified as ¢,(x) = VA €\(x) = 1 and @)(x) = VA &:(x) = 
X. E 


6.4.4 Kernels from Kernels 


The following theorem lists some useful properties for constructing reproducing kernels 
from existing reproducing kernels. 


Theorem 6.5: Rules for Constructing Kernels from Other Kernels 





Proof: For Rules 1, 2, and 3 it is easy to verify that the resulting function is finite, sym- 
metric, and positive semidefinite, and so is a valid reproducing kernel by Theorem 6.2. 
For example, for Rule 1 we have )ii-1 )j-; ai KO, Y;a; 2 0 for every choice of {a;}7, 
and {y;}"_, € RP, since «x is a reproducing kernel. In particular, it holds true for y; = 6(x;), 
i=1,...,n. Rule 4 is easy to show for kernels k,, K2 that admit a representation of the form 
(6.17), since 


K(x, x’) k(x, x’) = D gP æ) Pa) p (x) spe) 
i>l pl 
= S12) P(x) AP) GPR) 
i,jel 
= )) bil) bile’) =: K(X,2’), 
k>l 
showing that K = k;kK2 also admits a representation of the form (6.17), where the new (pos- 
sibly infinite) sequence of features (¢,) is identified in a one-to-one way with the sequence 
PoP). We leave the proof of rule 5 as an exercise (Exercise 8). Oo 
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E Example 6.10 (Polynomial Kernel) Consider x, x’ € R? with 
K(x, x’) = (1 + (x, x’), 


where (x, x’) = x'x’. This is an example of a polynomial kernel. Combining the fact that 
sums and products of kernels are again kernels (rules 3 and 4 of Theorem 6.5), we find that, 
since (x, x’) and the constant function 1 are kernels, so are 1 + (x, x’) and (1 + (x, x’))*. By 
writing 


K(x, x’) = (1 + xx, + x2x5)’ 


= 1+ 2x) +2xx + 2x1 xox xh + (x1 x})* + (1035), 


we see that x(x, x’) can be written as the inner product in R° of the two feature vectors (x) 
and #(x’), where the feature map ¢ : R? — R° can be explicitly identified as 


P(x) =[1, V2x1, V2x2, V2x1x2, x7, X31". 


Thus, the RKHS determined by x can be explicitly identified with the space of functions 
x + 6(x)'B for some B € R°. a 


In the above example we could explicitly identify the feature map. However, in general 
a feature map need not be explicitly available. Using a particular reproducing kernel cor- 
responds to using an implicit (possibly infinite dimensional!) feature map that never needs 
to be explicitly computed. 


6.5 Representer Theorem 


Recall the setting discussed at the beginning of this chapter: we are given training data 
T = {(x;,y,)}_, and a loss function that measures the fit to the data, and we wish to find 
a function g that minimizes the training loss, with the addition of a regularization term, 
as described in Section 6.2. To do this, we assume first that the class G of prediction 
functions can be decomposed as the direct sum of an RKHS H, defined by a kernel function 
K : X X X > R, and another linear space of real-valued functions Ho on X; that is, 


G=H®KH, 


meaning that any element g € G can be written as g = h + ho, with h € H and ho € Ho. 
In minimizing the training loss we wish to penalize the h term of g but not the ho term. 
Specifically, the aim is to solve the functional optimization problem 


1 n 
in — ) Lossy; g(x; oe 1 
i p De 1008D) + y lisli (6.19) 
Here, we use a slight abuse of notation: ||g||z, means ||All4 if g = h + ho, as above. In this 
way, we can view H as the null space of the functional g + ||g||g;. This null space may be 
empty, but typically has a small dimension m; for example it could be the one-dimensional 
space of constant functions, as in Example 6.2. 
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E Example 6.11 (Null Space) Consider again the setting of Example 6.2, for which we 
have feature vectors x = [1,x']' and G consists of functions of the form g : x  Bo+x'B. 
Each function g can be decomposed as g = h + ho, where h : x > x'B, and ho : ¥ > Bo. 
Given g € G, we have ||g||77 = ||B||, and so the null space Ho of the functional g > ||gll9, 
(that is, the set of all functions g € G for which ||g||z; = 0) is the set of constant functions 
here, which has dimension m = 1. E 


Regularization favors elements in Ho and penalizes large elements in H. As the reg- 
ularization parameter y varies between zero and infinity, solutions to (6.19) vary from 
“complex” (g € H ® Ho) to “simple” (g € Ho). 

A key reason why RKHSs are so useful is the following. By choosing H to be an 
RKHS in (6.19) this functional optimization problem effectively becomes a parametric 
optimization problem. The reason is that any solution to (6.19) can be represented as a 
finite-dimensional linear combination of kernel functions, evaluated at the training sample. 
This is known as the kernel trick. 


Theorem 6.6: Representer Theorem 





Proof: Let F = Span {k,,,i = 1,...,n}. Clearly, F C H. Then, the Hilbert space H can 
be represented as H = F ®F~, where F+ is the orthogonal complement of F. In other 
words, F+ is the class of functions 


(Fett f= 0, eF] tf eee =O; VI). 
It follows, by the reproducing kernel property, that for all f+ € F+: 
fd) = Fe) =0, i=1,...,0. 


Now, take any g € H © Ah, and write it as g = f + f> + họ, with feF, f+ E€ F+, and 
ho € Ho. By the definition of the null space Ho, we have ||gl|5, = If + f*llZ,. Moreover, by 


Pythagoras’ theorem, the latter is equal to || f line + If 2. It follows that 


1 r 1 n i 
= 2 Loss(y;, g8x;)) + Yllglloy = z 2, Loss(y;, f(x:) + ho(xi)) + y (Illia +f Ie.) 
1 n 
a 2, Loss; f(xi) + ho(xi)) + y IfI, 


Since we can obtain equality by taking f+ = 0, this implies that the minimizer of the pen- 
alized optimization problem (6.19) lies in the subspace F ® Ho of G = H ® Ho, and hence 
is of the form (6.20). oO 
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Substituting the representation (6.20) of g into (6.19) gives the finite-dimensional op- 
timization problem: 


ee L 3 Loss(y;, (Ka + Qy);) + ya' Ka, (6.21) 
where 
e K is the n xn (Gram) matrix with entries [k(x;,xj),i=1,...,n, j= 1,... 7]. 
e Q isthe n x m matrix with entries [¢ (x;),i = 1,...,n, j=1,...,m]. 


In particular, for the squared-error loss we have 


1 

min. — ||y - Ka + Qn) |} + ya"Ka. (6.22) 
acR”, neR” n 

This is a convex optimization problem, and its solution is found by differentiating (6.22) 

with respect to œ and ņ and equating to zero, leading to the following system of (n + m) 

linear equations: 





KK'+nyK KQ | H lor (6.23) 


aK — Q™Q4[n| 7 a 
As long as Q is of full column rank, the minimizing function is unique. 


E Example 6.12 (Ridge Regression (cont.)) We return to Example 6.2 and identify that 
H is the RKHS with linear kernel function x(x, x’) = x'x’ and C = Hois the linear space of 
constant functions. In this case, Ho is spanned by the function q; = 1. Moreover, K = XX" 
and Q= 1. 

If we appeal to the representer theorem directly, then the problem in (6.6) becomes, as 
a result of (6.21): 


min = || y — 1- XX"e |? + yX. 
ano nN 


This is a convex optimization problem, and so the solution follows by taking derivatives 
and setting them to zero. This gives the equations 


XX'((XX' +nyI,)@+m1-y) =0, 


and 
nn = 1 (y —- XX" a). 


Note that these are equivalent to (6.8) and (6.9) (once again assuming that n > p and X has 
full rank p). Equivalently, the solution is found by solving (6.23): 


XX™XX7 +nyXX7 XX71][a]_ [XXT 
17XxT n IIn | 1" [2 


This is a system of (n + 1) linear equations, and is typically of much larger dimension than 
the (p + 1) linear equations given by (6.8) and (6.9). As such, one may question the prac- 
ticality of reformulating the problem in this way. However, the benefit of this formulation 
is that the problem can be expressed entirely through the Gram matrix K, without having 
to explicitly compute the feature vectors — in turn permitting the (implicit) use of infinite 
dimensional feature spaces. LI 
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E Example 6.13 (Estimating the Peaks Function) Figure 6.4 shows the surface plot of 
the peaks function: 


1 
fæ 2) = 301 — eien — 10(= -x -— ajena — a (6.24) 


The goal is to learn the function y = f(x) based on a small set of training data (pairs of 
(x,y) values). The red dots in the figure represent data T = {(x;, yo. where y; = f(x;) and 
the {x;} have been chosen in a qguasi-random way, using Hammersley points (with bases 2 QUASI-RANDOM 
and 3) on the square [—3, 3]. Quasi-random point sets have better space-filling properties 
than either a regular grid of points or a set of pseudo-random points. We refer to [71] for 


details. Note that there is no observation noise in this particular problem. 
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Figure 6.4: Peaks function sampled at 20 Hammersley points. 


The purpose of this example is to illustrate how, using the small data set of size n = 20, 
the entire peaks function can be approximated well using kernel methods. In particular, we 
use the Gaussian kernel (6.15) on R?, and denote by H the unique RKHS corresponding 
to this kernel. We omit the regularization term in (6.19), and thus our objective is to find 
the solution to 


f 1 n , 
n - 2,0 — g(x). 


By the representer theorem, the optimal function is of the form 


k 1 |x — xl? 
g(x) = ) aj;exp (55), 


where @ := [qj,...,@,]' is, by (6.23), the solution to the set of linear equations KK'a@ = 
Ky. 

Note that we are performing regression over the class of functions H with an implicit 
feature space. Due to the representer theorem, the solution to this problem coincides with 
the solution to the linear regression problem for which the i-th feature (for i = 1,...,7) is 
chosen to be the vector [k(x1,X;),...,K(Xn,Xi)]". 

The following code performs these calculations and gives the contour plots of g and 
the peaks functions, shown in Figure 6.5. We see that the two are quite close. Code for the 
generation of Hammersley points is available from the book’s GitHub site as genham. py. 
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peakskernel.py 


from genham import hammersley 

import numpy as np 

import matplotlib.pyplot as plt 

from mpl_toolkits.mplot3d import Axes3D 
from matplotlib import cm 

from numpy.linalg import norm 


import numpy as np 
def peaks(x,y): 
z= (3*(1-x)**2 * np.exp(-Cx**2) - Cy+1)**2) 
LOOX 25) X38 ey 5) = perp Cox e2 ey 722) 
npr exp Ca Cx) 220s 20))) 
return(z) 


= 20 
= -3 + 6*hammersley([2,3],n) 
peaks(x[: ,0],x[:,1]) 
yy = np.mgrid[-3:3:150j,-3:3:150j] 
zz = peaks(xx,yy) 
plt.contour(xx,yy,zz,levels=50) 


fig=plt.figure() 

ax = fig.add_subplot(111,projection='3d') 

ax.plot_surface(xx,yy,zz,rstride=1,cstride=1,color='c',alpha=0.3, 
linewidth=0) 

ax.scatter(x[:,0],x[:,1],z,color='k',s=20) 

plt.show() 


sig2 = 0.3 # kernel parameter 
def k(x,u): 

return(np.exp(-0.5*norm(x- u)**2/sig2)) 
K = np.zeros((n,n)) 
for i in range(n): 

for j in range(n): 

Kite] Rl lexis) 

alpha = np.linalg.solve(K@K.T, K@z) 


xx.flatten().shape 
= np.zeros((n,N)) 
for i in range(n): 
for j in range(N): 
Kx[i,j] = kCx[i,:],np.array([xx.flattenQ) [j],yy.flattenQ(Q [j 
]])) 


g = Kx.T @ alpha 

dim = np.sqrt(N).astype(Cint) 

yhat = g.reshape (dim, dim) 
plt.contour(xx,yy,yhat, levels=50) 
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Figure 6.5: Contour plots for the prediction function g (left) and the peaks function given 
in (6.24) (right). 


6.6 Smoothing Cubic Splines 


A striking application of kernel methods is to fitting “well-behaved” functions to data. 
Key examples of “well-behaved” functions are those that do not have large second- 
order derivatives. Consider functions g : [0, 1] — R that are twice differentiable and define 


1 5 y m 
Ilg”? := f (g’’(x))? dx as a measure of the size of the second derivative. 


E Example 6.14 (Behavior of ||¢’||?) Intuitively, the larger ||g’’||? is, the more “wiggly” 
the function g will be. As an explicit example, consider g(x) = sin(wx) for x € [0, 1], where 
w is a free parameter. We can explicitly compute g” (x) = —w* sin(wx), and consequently 


1 4 
lle”? = f w" sin?’ (wx) dx = > (1 — sinc(2w)). 
0 


As |w| — oo, the frequency of g increases and we have ||g’’||? — oo. Oo 
Now, in the context of data fitting, consider the following penalized least-squares op- 
timization problem on [0, 1]: 


È 1 7 2 yy 
= ae 2 
o 2 (yi = g(x) + Ile I, (6.25) 


where we will specify G in what follows. In order to apply the kernel machinery, we want 
to write this in the form (6.19), for some RKHS H and null space Ho. Clearly, the norm on 
H should be of the form ||g||z, = ||g’’|| and should be well-defined (i.e., finite and ensuring 
g and g’ are absolutely continuous). This suggests that we take 


H = {g € L’(0, 1] : |lg’|| < 0%, g, g’ absolutely continuous, g(0) = g’(0) = 0}, 


with inner product 


1 
(F, DH = fF”) g” (x) dx. 
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CUBIC SPLINE 


One rationale for imposing the boundary conditions g(0) = g’(O) = 0 is as follows: when 
expanding g about the point x = 0, Taylor’s theorem (with integral remainder term) states 
that 


x 


g(x) = (0) +2’) x+ id a= ode. 
0 


Imposing the condition that g(0) = g’(0) = 0 for functions in H will ensure that G = 
H ® Ho where the null space Ho contains only linear functions, as we will see. 

To see that this H is in fact an RKHS, we derive its reproducing kernel. Using integra- 
tion by parts (or directly from the Taylor expansion above), write 


x ba 1 
g(x) = i g (s)ds = f g(s) (x - s)ds = f g” (s) (x — s), ds. 
0 


If x is a kernel, then by the reproducing property it must hold that 
1 
8(X) = (8, Kx) = f g” (s) Ky (s) ds, 
0 


so that x must satisfy Z K(x, s) = (x— s)+, where y, := max{y, 0}. Therefore, noting that 
K(X, u) = (Ky, K ð, We have (see Exercise 15) 


1 x(x, s) O?K(U, s) max{x, u} min{x, u} min{x, u} 
n= f dye = tg 








S — 
ðs? Os? 2 6 
The last expression is a cubic function with quadratic and cubic terms that misses the 
constant and linear monomials. This is not surprising considering the Taylor’s theorem 
interpretation of a function g € H. If we now take Ho as the space of functions of the 
following form (having zero second derivative): 


ho=m+mx, xe€[0,1], 


then (6.25) is exactly of the form (6.19). 
As a consequence of the representer Theorem 6.6, the optimal solution to (6.25) is a 
linear combination of piecewise cubic functions: 


e) =m txt b Qi K(Xj, X). (6.26) 


i=1 


Such a function is called a cubic spline with n knots (with one knot at each data point x;) 
— so called, because the piecewise cubic function between knots is required to be “tied 
together” at the knots. The parameters œ,ņ are determined from (6.21) for instance by 
solving (6.23) with matrices K = [k(x;, x;)] and Q with i-th row of the form [1, x;] for 
al Cs 2 


n 
i,j=l 


E Example 6.15 (Smoothing Spline) Figure 6.6 shows various cubic smoothing splines 
for the data (0.05, 0.4), (0.2, 0.2), (0.5, 0.6), (0.75, 0.7), (1, 1). In the figure, we use the re- 
parameterization r = 1/(1 + ny) for the smoothing parameter. Thus r € [0, 1], where r = 0 
means an infinite penalty for curvature (leading to the ordinary linear regression solution) 
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and r = 1 does not penalize curvature at all and leads to a perfect fit via the so-called nat- 
ural spline. Of course the latter will generally lead to overfitting. For r from 0 up to 0.8 the 
solutions will be close to the simple linear regression line, while only for r very close to 1, 
the shape of the curve changes significantly. 


ir 
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0 1 1 1 1 J 
0 0.2 0.4 0.6 0.8 1 
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Figure 6.6: Various cubic smoothing splines for smoothing parameter r = 1/(1 + ny) € 
{0.8, 0.99, 0.999, 0.999999}. For r = 1, the natural spline through the data points is ob- 
tained; for r = 0, the simple linear regression line is found. 


The following code first computes the matrices K and Q, and then solves the linear 
system (6.23). Finally, the smoothing curve is determined via (6.26), for selected points, 
and then plotted. Note that the code plots only a single curve corresponding to the specified 
value of p. 





smoothspline.py 


import matplotlib.pyplot as plt 
import numpy as np 


np.array([[0.05, 0.2, 0.5, 0.75, 1.]]).T 


< 
oil 


np.array([[0.4, 0.2, 0.6, 0.7, 1.]]).T 
n = x.Shape[0] 
r = 0.999 
ngamma = (1-r)/r 


k = lambda x1, x2 : (1/2)* np.max((x1,x2)) * np.min((x1,x2)) ** 2 \ 
- ((1/6)* np.min((x1,x2))**3) 
K = np.zeros((n,n)) 
for i in range(n): 
for j in range(n): 
K[i,j] = k(x[il, x[j]) 


Q = np.hstack((np.ones((n,1)), x)) 


m1 
m2 


np.hstack((K @ K.T + (ngamma * K), K @ Q)) 
np.hstack((Q.T @ K.T, Q.T @ Q)) 


Ww 
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M = np.vstack((ml1,m2)) 
c = np.vstack((K, Q.T)) @y 
ad = np.linalg.solve(M,c) 
# plot the curve 
XxX = np.arange(0,1+0.01,0.01).reshape(-1,1) 
g = np.zeros_like(xx) 
Qx = np.hstack((np.ones_like(xx), xx)) 
g = np.zeros_like(xx) 
N = np. shape (xx) [0] 
Kx = np.zeros((n,N)) 
for i in range(n): 
for j in range(N): 
Kx[i,j] = k(x[i], xx[j]) 
g = g + np.hstack((Kx.T, Qx)) @ ad 
plt.ylim((0,1.15)) 
plt.plot(xx, g, label = 'r = {}'.format(r), linewidth = 2) 
plt.plot(x,y, 'b.', markersize=15) 
plt.xlabel('$x$"') 
plt.ylabel('$y$') 
plt.legend() 
6.7 Gaussian Process Regression 
Another application of the kernel machinery is to Gaussian process regression. A Gaussian 
GAUSSIAN process (GP) on a space X is a stochastic process {Z,,x € X} where, for any choice of 
PROCESS indices x,,...,X,, the vector [Z,,,...Z,,]' has a multivariate Gaussian distribution. As 
such, the distribution of a GP is completely specified by its mean and covariance functions 
H: X — Randk : XXX —> R, respectively. The covariance function is a finite positive 
semidefinite function, and hence, in view of Theorem 6.2, can be viewed as a reproducing 
kernel on X. 
r= 168 As for ordinary regression, the objective of GP regression is to learn a regression func- 
tion g that predicts a response y = g(x) for each feature vector x. This is done in a Bayesian 
fashion, by establishing (1) a prior pdf for g and (2) the likelihood of the data, for a given 
g. From these two we then derive, via Bayes’ formula, the posterior distribution of g given 
us 47 the data. We refer to Section 2.9 for the general Bayesian framework. 


A simple Bayesian model for GP regression is as follows. First, the prior distribution of 
g is taken to be the distribution of a GP with some known mean function u and covariance 
function (that is, kernel) x. Most often jz is taken to be a constant, and for simplicity of 
exposition, we take it to be 0. The Gaussian kernel (6.15) is often used for the covariance 
function. For radial basis function kernels (including the Gaussian kernel), points that are 
closer will be more highly correlated or “similar” [97], independent of translations in space. 
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Second, similar to standard regression, we view the observed feature vectors X1,...,Xpn 
as fixed and the responses y;,..., Yn aS outcomes of random variables Y,,..., Y,,. Specific- 
ally, given g, we model the {Y;} as 


Y; = g(x) + £i, l= rs (6.27) 


where {e,} is N(0, 07). To simplify the analysis, let us assume that øo? is known, so no prior 
needs to be specified for o°. Let g = [g(x1),...,9(Xn)]" be the (unknown) vector of re- 
gression values. Placing a GP prior on the function g is equivalent to placing a multivariate 
Gaussian prior on the vector g: 

g ~ N(O,K), (6.28) 


where the covariance matrix K of g is a Gram matrix (implicitly associated with a feature 
map through the kernel x), given by: 


K(X1,X1) K(X1,X2) ... K(X1, Xn) 


K- A a 7 m : (6.29) 


K(Xn, X1) K(Xn, X2) ... K(X, Xn) 


The likelihood of our data given g, denoted p(y | g), is obtained directly from the model 
(6.27): 
(Y1g) ~ N(g,o°l,). (6.30) 


Solving this Bayesian problem involves deriving the posterior distribution of (g |Y). To 
do so, we first note that since Y has covariance matrix K + o7I,, (which can be seen from 
(6.27)), the joint distribution of Y and g is again normal, with mean 0 and covariance 
matrix: 


Ky. = 


K KI" (6.31) 





K+o'l, A 


The posterior can then be found by conditioning on Y = y, via Theorem C.8, giving 
(gly) ~ N(K"(K + o°IL)'y, K -KK + o°I,)'K). 


This only gives information about g at the observed points x,,...,X,. Itis more interesting 
to consider the posterior predictive distribution of g := g(x) for a new input x. We can find 
the corresponding posterior predictive pdf pg | y) by integrating out the joint posterior pdf 
p(g, Z|y), which is equivalent to taking the expectation of p(g| g) when g is distributed 
according to the posterior pdf p(g |y); that is, 


rGiy)= | PELD (ely) de. 


To do so more easily than direct evaluation via the above integral representation of p(g| y), 
we can begin with the joint distribution of [y', g]', which is multivariate normal with mean 
0 and covariance matrix 


(6.32) 





K +l, K 
K" KX, X)|” 


rs 436 
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PREDICTIVE 


where kK = [k(X,xX1),...,«(X,X,)]". It now follows, again by using Theorem C.8, that (g| y) 
has a normal distribution with mean and variance given respectively by 


u(x) =K'(K+o°l,)'y (6.33) 
and 
P) = KE, X) -K (K + o°, y kK. (6.34) 


These are sometimes called the predictive mean and variance. It is important to note that 
we are predicting the expected response EY = g(x) here, and not the actual response Y. 


E Example 6.16 (GP Regression) Suppose the regression function is 
g(x) =2sin(27x), x € [0,1]. 


We use GP regression to estimate g, using a Gaussian kernel of the form (6.15) with band- 
width parameter 0.2. The explanatory variables x),...,x39 were drawn uniformly on the 
interval [0, 1], and the responses were obtained from (6.27), with noise level o = 0.5. Fig- 
ure 6.7 shows 10 samples from the prior distribution for g as well as the data points and 
the true sinusoidal regression function g. 

















0 02 04 06 08 1 “oO 02 04 06 08 1 
T Hr 


Figure 6.7: Left: samples drawn from the GP prior distribution. Right: the true regression 
function with the data points. 


Again assuming that the variance g?, is known, the predictive distribution as determ- 
ined by (6.33) and (6.34) is shown in Figure 6.8 for bandwidth 0.2 (left) and 0.02 (right). 
Clearly, decreasing the bandwidth leads to the covariance between points x and x’ decreas- 
ing at a faster rate with respect to the squared distance ||x — x’||?, leading to a predictive 
mean that is less smooth. oO 


In the above exposition, we have taken the mean function for the prior distribution 
of g to be identically zero. If instead we have a general mean function m and write 
m = [m(x)),...,m(x,)]' then the predictive variance (6.34) remains unchanged, and the 
predictive mean (6.33) is modified to read 


wx) = mx) +«K'(K+o°L,) | O- m). (6.35) 
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x T 
Figure 6.8: GP regression of synthetic data set with bandwidth 0.2 (left) and 0.02 (right). 
The black dots represent the data and the blue curve is the latent function g(x) = 2 sin(27x). 
The red curve is the mean of the GP predictive distribution given by (6.33), and the shaded 
region is the 95% confidence band, corresponding to the predictive variance given in (6.34). 


Typically, the variance o appearing in (6.27) is not known, and the kernel x itself 
depends on several parameters — for instance a Gaussian kernel (6.15) with an unknown 
bandwidth parameter. In the Bayesian framework, one typically specifies a hierarchical 
model by introducing a prior p(@) for the vector 0 of such hyperparameters. Now, the 
GP prior (g |0) (equivalently, specifying p(g|@)) and the model for the likelihood of the 
data given Y|g, 0, namely p(y | g, 0), are both dependent on @. The posterior distribution of 
(g|y, 9) is as before. 

One approach to setting the hyperparameter 0 is to determine its posterior p(0 |y) and 
obtain a point estimate, for instance via its maximum a posteriori estimate. However, this 
can be a computationally demanding exercise. What is frequently done in practice is to 
consider instead the marginal likelihood p(y |0) and maximize this with respect to 0. This 
procedure is called empirical Bayes. 

Considering again the mean function m to be identically zero, from (6.31), we have 
that (Y|@) is multivariate normal with mean 0 and covariance matrix K, = K + ol, 
immediately giving an expression for the marginal log-likelihood: 

y 


In p10) = -5 In(27) - sin | det(K,)| - Lyk; y. (6.36) 


We notice that only the second and third terms in (6.36) depend on 8. Considering a partial 
derivative of (6.36) with respect to a single element 6 of the hyperparameter vector 0 yields 


1 
+ <y"K,! $s] K;'y, (6.37) 


ô 1 alô 
— ln p(y |0) = -3te(K, aK, 5 70 


0 oo 





where | 3k, | is the element-wise derivative of matrix K, with respect to 6. If these partial 


derivatives can be computed for each hyperparameter 6, gradient information could be used 
when maximizing (6.36). 


HYPERPARAMET- 
ERS 


EMPIRICAL BAYES 
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ms 153 


E Example 6.17 (GP Regression (cont.)) Continuing Example 6.16, we plot in Fig- 
ure 6.9 the marginal log-likelihood as a function of the noise level o and bandwidth para- 
meter. 





10° 








10 fa 
102 10 10° 
Bandwidth 





Figure 6.9: Contours of the marginal log-likelihood for the GP regression example. The 
maximum is denoted by a cross. 


The maximum is attained for a bandwidth parameter around 0.20 and o ~ 0.44, which 
is very close to the left panel of Figure 6.8 for the case where o was assumed to be known 
(and equal to 0.5). We note here that the marginal log-likelihood is extremely flat, perhaps 
owing to the small number of points. E 


6.8 Kernel PCA 


In its basic form, kernel PCA (principal component analysis) can be thought of as PCA in 
feature space. The main motivation for PCA introduced in Section 4.8 was as a dimension- 
ality reduction technique. There, the analysis rested on an SVD of the matrix £ = iXTX, 
where the data in X was first centered via x! , = x; j — Xj where X; = + Xj- Xij. 

What we shall do is to first re-cast the problem in terms of the Gram matrix K = XX‘ = 
[(x;,x;)] (note the different order of X and X"), and subsequently replace the inner product 
(x, x’) with x(x, x’) for a general reproducing kernel x. To make the link, let us start with 
an SVD of X': 





X' = UDV'. (6.38) 


The dimensions of X', U, D, and V are d xn, d x d, d Xn, and n x n, respectively. Then an 
SVD of X'X is 
X'X = (UDV')(UDV')' = U(DD')U' 


and an SVD of K is 
K = (UDV')'(UDV') = V(D' D)V'. 


Let A; > --- > A, > 0 denote the non-zero eigenvalues of X'X (or, equivalently, of K) and 
denote the corresponding r x r diagonal matrix by A. Without loss of generality we can 
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assume that the eigenvector of X'X corresponding to A; is the k-th column of U and that 
the k-th column of V is an eigenvector of K. Similar to Section 4.8, let U, and V, contain 
the first k columns of U and V, respectively, and let A; be the corresponding kxk submatrix 
of A, k =1,...,r. 

By the SVD (6.38), we have X'V; = UDV'V; = UA |’ 2? Next, consider the projection 
of a point x onto the k-dimensional linear space spanned by the columns of U; — the first 
k principal components. We saw in Section 4.8 that this projection simply is the linear 
mapping x +» UĮx. Using the fact that U, = X™V,A7'”, we find that x is projected to a 
point z given by 


-1/2 -1/2 
z= A, PVI Xx =A, PV ke, 


where we have (suggestively) defined xk, := [(x1, x}, ..., (Xn, X)}]'. The important point 
is that z is completely determined by the vector of inner products x, and the k principal 
eigenvalues and (right) eigenvectors of the Gram matrix K. Note that each component Zm 
of z is of the form 


En = D Omi Kx), m= yeahs (6.39) 
i=1 


The preceding discussion assumed centering of the columns of X. Consider now an 
uncentered data matrix X. Then the centered data can be written as X = X — *E,X, where 
E, is the n x n matrix of ones. Consequently, 


-r 1 


1 
XX’ = XX - 


ar mene 1 aT 
E,XX --XX E, + —E,XX E,, 
n n 


n 
or, more compactly, XX" = H XX H, where H = I„— 11,1}, I, is the n Xn identity matrix, 
and 1, is the n x 1 vector of ones. 

To generalize to the kernel setting, we replace XX by K = [k(x;,x,;),i,j = 1,...,n] 
and set Ky = [k(x1,X),...,K(Xn,xX)]", so that A; is the diagonal matrix of the k largest eigen- 
values of HKH and V; is the corresponding matrix of eigenvectors. Note that the “usual” 
PCA is recovered when we use the linear kernel x(x, y) = x" y. However, instead of having 
only kernels that are explicitly inner products of feature vectors, we are now permitted to 
implicitly use infinite feature maps (functions) by using kernels. 


m Example 6.18 (Kernel PCA) We simulated 200 points, x1, ..., X200, from the uniform 
distribution on the set B; U (By N BS), where B, := {(x,y) € R? : x? +y? < r°} (disk with 
radius r). We apply kernel PCA with Gaussian kernel x(x,x’) = exp (=Ilx — x'IP) and 
compute the functions Zm(x),m = 1,...,9 in (6.39). Their density plots are shown in Fig- 
ure 6.10. The data points are superimposed in each plot. From this we see that the principal 
components identify the radial structure present in the data. Finally, Figure 6.11 shows 
the projections [z1 (x;), z2(x;:)]", i = 1,...,200 of the original data points onto the first two 
principal components. We see that the projected points can be separated by a straight line, 
whereas this is not possible for the original data; see also, Example 7.6 for a related prob- 
lem. 
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Figure 6.10: First nine eigenfunctions using a Gaussian kernel for the two-dimensional 
data set formed by the red and cyan points. 
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Figure 6.11: Projection of the data onto the first two principal components. Observe that 
already the projections of the inner and outer points are well separated. 
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Further Reading 

For a good overview of the ridge regression and the lasso, we refer the reader to [36, 56]. 

For overviews of the theory of RKHS we refer to [3, 115, 126], and for in-depth background 

on splines and their connection to RKHSs we refer to [123]. For further details on GP 

regression we refer to [97] and for kernel PCA in particular we refer to [12, 92]. Finally, 

many facts about kernels and their corresponding RKHSs can be found in [115]. 

Exercises 

1. Let G be an RKHS with reproducing kernel x. Show that x is a positive semidefinite 
function. 

2. Show that a reproducing kernel, if it exists, is unique. 

3. Let G be a Hilbert space of functions g : X — R. Recall that the evaluation func- 
tional is the map ôy : g =œ g(x) for a given x € X. Show that evaluation functionals 
are linear operators. 

4. Let Go be the pre-RKHS Gp constructed in the proof of Theorem 6.2. Thus, g € Go 
is of the form g = };;-1 Œi Ky, and 

(8, Kx)Go = DG Qi (Kx; Kx) Go = >, Qi; K(X;,X) = g(x). 
i=l i=l 
Therefore, we may write the evaluation functional of g € Go at x as xg := (8, Kx}Go- 
Show that 6, is bounded on Go for every x; that is, |ô f| < y|lfllg,, for some y < ov. 

5. Continuing Exercise 4, let (fp) be a Cauchy sequence in Go such that |f,(x)| — O for 
all x. Show that ||fnllg, > 9. 

6. Continuing Exercises 5 and 4, to show that the inner product (6.14) is well defined, 
a number of facts have to be checked. 

(a) Verify that the limit converges. 

(b) Verify that the limit is independent of the Cauchy sequences used. 

(c) Verify that the properties of an inner product are satisfied. The only non-trivial 
property to verify is that (f, f)g = 0 if and only if f = 0. 

7. Exercises 4—6 show that G defined in the proof of Theorem 6.2 is an inner product 
space. It remains to prove that G is an RKHS. This requires us to prove that the inner 
product space G is complete (and thus Hilbert), and that its evaluation functionals 
are bounded and hence continuous (see Theorem A.16). This is done in a number of IS 389 


steps. 


(a) Show that Go is dense in G in the sense that every f € G is a limit point (with 
respect to the norm on G) of a Cauchy sequence (fa) in Go. 
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10. 


11. 


12. 


HILBERT SPACE 
ISOMORPHISM 


13. 
14. 


(b) Show that every evaluation functional 6, on G is continuous at the 0 function. 
That is, 
Ve>0:46>0:VfP EG: |lfllg < = |f(x)| < e. (6.40) 


Continuity of 6, at all functions g € G then follows automatically from linearity. 


(c) Show that G is complete; that is, every Cauchy sequence (fa) E€ G converges in 
the norm ||- |Ig. 


. If kı and x are kernels on X and Y, then x,((x, y), (x’, y’)) := K(x, x’) + kKo(y, y’) 


and ky ((x, y), (x’, y’) := K(x, X’)ko(y, y’) are kernels on the Cartesian product X x Y. 
Prove this. 


. An RKHS enjoys the following desirable smoothness property: if (gn) is a sequence 


belonging to RKHS G on X, and ||g,,— gll — 0, then g(x) = lim, g,(x) for all x € X. 
Prove this, using Cauchy—Schwarz. 


Let X be an R“-valued random variable that is symmetric about the origin (that is, 
X and (—X) are identically distributed). Denote by yp is its distribution and W(t) = 
Ee" X = f etf * u(dx) for t € Rf is its characteristic function. Verify that x(x, x’) = 
y(x — x’) is a real-valued positive semidefinite function. 


Suppose an RKHS G of functions from X — R (with kernel x) is invariant under a 
group T of transformations T : X — X; that is, for all f, g € G and T € T, we have 
(i) fo T € Gand (ii) (f oT, go T)g = (f, g)g. Show that «(Tx, Tx’) = k(x, x’) for 
allx,x’eNXandTeT. 


Given two Hilbert spaces H and G, we call a mapping A : H — G a Hilbert space 
isomorphism if it is 


(i) a linear map; that is, A(af +bg) = aA(f)+bA(g) for any f,g € H anda,b eR. 
(ii) a surjective map; and 


(iii) an isometry; that is, for all f, g € H, it holds that (f, g)4, = (Af, Ag)e. 


Let H = R” (equipped with the usual Euclidean inner product) and construct its 
(continuous) dual space G, consisting of all continuous linear functions from R? to 
R, as follows: (a) For each B € R”, define gg : R? > R via gg(x) = (B, x) = B'x, for 
all x € R”. (b) Equip G with the inner product (gg, gy)¢ := B'y. 

Show that A : H — G defined by A(B) = gg for B € R? is a Hilbert space isomorph- 
ism. 


Let X be an n x p model matrix. Show that X'X + n yI, for y > 0 is invertible. 


As Example 6.8 clearly illustrates, the pdf of a random variable that is symmetric 
about the origin is not in general a valid reproducing kernel. Take two such iid ran- 
dom variables X and X’ with common pdf f, and define Z = X + X’. Denote by Wz 
and fz the characteristic function and pdf of Z, respectively. 


Show that if wz is in L'(R), fz isa positive semidefinite function. Use this to show 
that K(x, x’) = fz(x—x’) = I{|x-x’| < 2}(1 - |x- x'|/2) is a valid reproducing kernel. 
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15. 


16. 


17. 


max{x,u} min{ x,u}? = 


For the smoothing cubic spline of Section 6.6, show that x(x, u) = 3 


min{x,u} 


6 
Let X be an n x p model matrix and let u € R? be the unit-length vector with k-th 
entry equal to one (ux = ||u|| = 1). Suppose that the k-th column of X is v and that it 
is replaced with a new predictor w, so that we obtain the new model matrix: 


X=X+w-vyu”. 


(a) Denoting 
Iw — vl 
2 > 





6:=X'w-yv)+ 


show that 





(u + ô)\(u +6)" _ (u-ô)(u - 6)" 


K'X =X"X+u6" +6u7 = XX + ; : 


In other words, X X differs from XTX by a symmetric matrix of rank two. 

(b) Suppose that B := (XTX + nyI,)' is already computed. Explain how the 
Sherman—Morrison formulas in Theorem A.10 can be applied twice to com- 
pute the inverse and log-determinant of the matrix X Xn yip in O(n + p)p) 
computing time, rather than the usual O((n + p?)p) computing time. 

(c) Write a Python program for updating a matrix B = (XTX +nyI,)"' when we 
change the k-th column of X, as shown in the following pseudo-code. 


Algorithm 6.8.1: Updating via Sherman—Morrison Formula 
input: Matrices X and B, index k, and replacement w for the k-th column of X. 
output: Updated matrices X and B. 

1 Set vy € R” to be the k-th column of X. 

2 Set u € R’ to be the unit-length vector such that ug = ||u|| = 1. 


Bud'B 
1+6'Bu 
Bou'B 
1 +u'Bô 


5 Update the k-th column of X with w. 
6 return X, B 


3 Bc B- 





4BcB- 


Use Algorithm 6.8.1 from Exercise 16 to write Python code that computes the ridge 
regression coefficient $ in (6.5) and use it to replicate the results on Figure 6.1. The 
following pseudo-code (with running cost of O((n+ p)p”)) may help with the writing 
of the Python code. 





3This Sherman—Morrison updating is not always numerically stable. A more numerically stable method 
will perform two consecutive rank-one updates of the Cholesky decomposition of X™X + n yI. 


tS 371 


rs 217 
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Algorithm 6.8.2: Ridge Regression Coefficients via Sherman—Morrison Formula 
input: Training set {X, y} and regularization parameter y > 0. 
output: Solution B = (ny I, + X™X)'X'y. 

1 Set A to be ann x p matrix of zeros and B + (nyI,)"'. 

2 for j=1,...,pdo 

3 Set w to be the j-th column of X. 

4 | Update {A, B} via Algorithm 6.8.1 with inputs {A, B, j, w}. 


s B — B(XTy) 
6 return 


ms 55 18. Consider Example 2.10 with D = diag(44,..., 4p) for some nonnegative vector A € 
R”, so that twice the negative logarithm of the model evidence can be written as 
—2 In g(y) = I(A) := nln[y" (i — XZX")y] + In |D] — In |Z] + c, 
where c is a constant that depends only on n. 
is 371 (a) Use the Woodbury identities (A.15) and (A.16) to show that 
I- Xxx" = (1+ XDXx’')! 
In |D] — In || = In|I + XDX' |. 
Deduce that /(A) = nIn[y™ Cy] — In|C| + c, where C := (I+ XDX‘)!. 
(b) Let [v;,...,¥,] := X denote the p columns/predictors of X. Show that 


Explain why setting 2; = 0 has the effect of excluding the k-th predictor from 
the regression model. How can this observation be used for model selection? 


(c) Prove the following formulas for the gradient and Hessian elements of /(A): 








oie v; Cv; - nE 
UATT (641) 
al (v7 Cy)(vt Cy) |? l 


=(n- 1) Cv -n|v Cv; - 


OA; 0a; y'Cy 


(d) One method to determine which predictors in X are important is to compute 


X := argmin /(A) 
A>0 
ns 419 using, for example, the interior-point minimization Algorithm B.4.1 with gradi- 
ent and Hessian computed from (6.41). Write Python code to compute 4° and 
use it to select the best polynomial model in Example 2.10. 
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19. (Exercise 18 continued.) Consider again Example 2.10 with D = diag(A,,...,A,) for 
some nonnegative model-selection parameter A € RP. A Bayesian choice for A is the 
maximizer of the marginal likelihood g(y | A); that is, 


A" = argmax IEG o°, y |A) dB do?, 
A>0 


where 


_XBl?+B"D'B 1 
iao yay en ips E Ene. 
20? 2 2 


To maximize g(y |4), one can use the EM algorithm with B and o°? acting as latent 
variables in the complete-data log-likelihood 1n g(B, o°, y | A). Define 


L:=(D!+x'"x)y! 
B:==X'y (6.42) 
F := (Iyl? - y"XB) /n. 


(a) Show that the conditional density of the latent variables B and o” is such that 





(o° ay) ~ Gamma‘. =a) 
(B|a.07,y) ~ N(B, 0°). 





(b) Use Theorem C.2 to show that the expected complete-data log-likelihood is 


B DB t(D'Z)+In[D| 
Gee a 





where c; is a constant that does not depend on A. 


(c) Use Theorem A.2 to simplify the expected complete-data log-likelihood and to 
show that it is maximized at A; = XZ, + (6,/)? fori = 1,..., p. Hence, deduce 
the following E and M steps in the EM algorithm: 


E-step. Given A, update (£, B, 7) via the formulas (6.42). 
M-step. Given (2,8, 0°), update a via 4; = L + (B,/F), i= 1,..., p. 


(d) Write Python code to compute A“ via the EM algorithm, and use it to select 
the best polynomial model in Example 2.10. A possible stopping criterion is to 
terminate the EM iterations when 


In g(y lAn) -Ing |A) < € 


for some small £ > 0, where the marginal log-likelihood is 


In g(y|a) = -5 In(nne) — >in D| + >in Z| + In T(n/2). 
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Exercises 
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EARLY STOPPING 


ws 366 


20. In this exercise we explore how the early stopping of the gradient descent iterations 
(see Example B.10), 


Xni = xX; — QV f(x), t=0,1,..., 


is (approximately) equivalent to the global minimization of f(x) + Sy llxl|? for certain 
values of the ridge regularization parameter y > 0 (see Example 6.1). We illustrate 
the early stopping idea on the quadratic function f(x) = I(x — p)'H(x — u), where 
H € R”” is a symmetric positive-definite (Hessian) matrix with eigenvalues {A;}7_,. 


(a) Verify that for a symmetric matrix A € R” such that I — A is invertible, we have 
I+A + +A! =(1-A)(I- A). 


(b) Let H = QAQ' be the diagonalization of H as per Theorem A.8. If xo = 0, 
show that the formula for x, is 


x, = y- QA- aAYQ' u. 


Hence, deduce that a necessary condition for x, to converge is œ < 2/ max, Ap. 


(c) Show that the minimizer of f(x) + tyllxll? can be written as 
x* =u- QA+y AQ. 


(d) For a fixed value of t, let the learning rate a | 0. Using part (b) and (c), show 
that if y ~ 1/(tæ) as a | 0, then x, = x*. In other words, x, is approximately 
equal to x* for small œ, provided that y is inversely proportional to ta. 
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The purpose of this chapter is to explain the mathematical ideas behind well-known 
classification techniques such as the naive Bayes method, linear and quadratic discrim- 
inant analysis, logistic/softmax classification, the K-nearest neighbors method, and 
support vector machines. 


7.1 Introduction 


Classification methods are supervised learning methods in which a categorical response 
variable Y takes one of c possible values (for example whether a person is sick or healthy), 
which is to be predicted from a vector X of explanatory variables (for example, the blood 
pressure, age, and smoking status of the person), using a prediction function g. In this 
sense, g classifies the input X into one of the classes, say in the set {0,...,c — 1}. For this 
reason, we will call g a classification function or simply classifier. As with any supervised 
learning technique (see Section 2.3), the goal is to minimize the expected loss or risk 


&g) = E Loss(Y, 9(X)) (7.1) 


for some loss function, Loss(y, y), that quantifies the impact of classifying a response y via 
y = g(x). The natural loss function is the zero—one (also written 0-1) or indicator loss: 
Loss(y, y) := 1{y + y}; that is, there is no loss for a correct classification (y = y) and a 
unit loss for a misclassification (y + y). In this case the optimal classifier g* is given in the 
following theorem. 


Theorem 7.1: Optimal classifier 





Proof: The goal is to minimize €(g) = E1{Y + g(X)} over all functions g taking values in 
{0,...,c — 1}. Conditioning on X gives, by the tower property, f(g) = E(PLY + g(X)|X]), 
and so minimizing f(g) with respect to g can be accomplished by maximizing P[Y = 
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g(x)|X = x] with respect to g(x), for every fixed x. In other words, take g(x) to be equal 
to the class label y for which P[Y = y| X = x] is maximal. o 


The formulation (7.2) allows for “ties”, when there is an equal probability between 
optimal classes for a feature vector x. Assigning one of these tied classes arbitrarily (or 
randomly) to x does not affect the loss function and so we assume for simplicity that g* (x) 
is always a scalar value. 

rs 21 Note that, as was the case for the regression (see, e.g., Theorem 2.1), the optimal pre- 
diction function depends on the conditional pdf f(y |x) = PLY = y| X = x]. However, since 
we assign x to class y if f(y|x) > f(z|x) for all z, we do not need to learn the entire sur- 
face of the function f(y|x); we only need to estimate it well enough near the decision 
boundary {x : f(y|x) = f(z|x)} for any choice of classes y and z. This is because the as- 
signment (7.2) divides the feature space into c regions, R, = {x : f(y|x) = max, f(z|x)}, 
y=0,...,c-1. 

Recall that for any supervised learning problem the smallest possible expected loss 
(that is, the irreducible risk) is given by ¢* = €(g*). For the indicator loss, the irreducible 
risk is equal to P[Y + g*(X)]. This smallest possible probability of misclassification is 


BAYES ERROR often called the Bayes error rate. 
RATE 


For a given training set T, a classifier is often derived from a pre-classifier g,, which 
A is a prediction function (learner) that can take any real value, rather than only values 
in the set of class labels. A typical situation is the case of binary classification with 
labels —1 and 1, where the prediction function g- is a function taking values in the 


interval [—1,1] and the actual classifier is given by sign(g;). It will be clear from 
the context whether a prediction function g, should be interpreted as a classifier or 
pre-classifier. 





The indicator loss function may not always be the most appropriate choice of loss 
function for a given classification problem. For example, when diagnosing an illness, the 
mistake in misclassifying a person as being sick when in fact the person is healthy may 
be less serious than classifying the person as healthy when in fact the person is sick. In 
Section 7.2 we consider various classification metrics. 

There are many ways to fit a classifier to a training set T = {(x1, y1), -< , (Xn, Yn)}. The 
approach taken in Section 7.3 is to use a Bayesian framework for classification. Here the 
conditional pdf f(y|x) is viewed as a posterior pdf f(y |x) ~ f(x|y)f(y) for a given class 
prior f(y) and likelihood f(x |y). Section 7.4 discusses linear and quadratic discriminant 
analysis for classification, which assumes that the class of approximating functions for the 
conditional pdf f(x |y) is a parametric class G of Gaussian densities. As a result of this 
choice of G, the marginal f(x) is approximated via a Gaussian mixture density. 

In contrast, in the logistic or soft-max classification in Section 7.5, the conditional 
pdf f(y|x) is approximated using a more flexible class of approximating functions. As a 
result of this, the approximation to the marginal density f(x) does not belong to a simple 
parametric class (such as a Gaussian mixture). As in unsupervised learning, the cross- 
entropy loss is the most common choice for training the learner. 

The K-nearest neighbors method, discussed in Section 7.6, is yet another approach to 
classification that makes minimal assumptions on the class G. Here the aim is to directly 
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estimate the conditional pdf f(y|x) from the training data, using only feature vectors in 
the neighborhood of x. In Section 7.7 we explain the support vector methodology for clas- 
sification; this is based on the same Reproducing Kernel Hilbert Space ideas that proved 
successful for regression analysis in Section 6.3. Finally, a versatile way to do both clas- 
sification and regression is to use classification and regression trees. This is the topic of 
Chapter 8. Neural networks (Chapter 9) provide yet another way to perform classification. 


7.2 Classification Metrics 


The effectiveness of a classifier g is, theoretically, measured in terms of the risk (7.1), which 
depends on the loss function used. Fitting a classifier to iid training data T = {(x;, y;)}_, iS 
established by minimizing the training loss 


n 
£9) = =) Lossin g) (7.3) 
a= 

over some class of functions G. As the training loss is often a poor estimator of the risk, 
the risk is usually estimated as in (7.3), using instead a test set rT’ = (x,y) #2} that is 
independent of the training set, as explained in Section 2.3. To measure the performance 
of a classifier on a training or test set, it is convenient to introduce the notion of a loss 
matrix. Consider a classification problem with classifier g, loss function Loss, and classes 
0,...,c — 1. If an input feature vector x is classified as y = g(x) when the observed class 
is y, the loss incurred is, by definition, Loss(y, yY). Consequently, we may identify the loss 
function with a matrix L = [Loss(j, k), j,k € {0,...,c— 1}]. For the indicator loss function, 
the matrix L has Os on the diagonal and 1s everywhere else. Another useful matrix is the 
confusion matrix, denoted by M, where the (j, k)-th element of M counts the number of 
times that, for the training or test data, the actual (observed) class is j whereas the predicted 
class is k. Table 7.1 shows the confusion matrix of some Dog/Cat/Possum classifier. 


Table 7.1: Confusion matrix for three classes. 








Predicted 
Actual Dog Cat Possum 
Dog 30 2 6 
Cat 8 22 15 


Possum 7 4 41 


We can now express the classifier performance (7.3) in terms of L and M as 


1 
a XIL © M] x, 
ik 


where L © M is the elementwise product of L and M. Note that for the indicator loss, (7.4) 
is simply 1 — tr(M)/n, and is called the misclassification error. The expression (7.4) makes 
it clear that both the counts and the loss are important in determining the performance of a 
classifier. 


(7.4) 


ws 222 
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TRUE POSITIVE 


TRUE NEGATIVE 


FALSE POSITIVE 
FALSE NEGATIVE 


ACCURACY 


In the spirit of Table C.4 for hypothesis testing, it is sometimes useful to divide the 
elements of a confusion matrix into four groups. The diagonal elements are the true positive 
counts; that is, the numbers of correct classifications for each class. The true positive counts 
for the Dog, Cat, and Possum classes in Table 7.1 are 30, 22, and 41, respectively. Similarly, 
the true negative count for a class is the sum of all matrix elements that do not belong to the 
row or the column of this particular class. For the Dog class it is 22+ 15+4+41 = 82. The 
false positive count for a class is the sum of the corresponding column elements without 
the diagonal element. For the Dog class it is 8 + 7 = 15. Finally, the false negative count 
for a specific class, can be calculated by summing over the corresponding row elements 
(again, without counting the diagonal element). For the Dog class it is 2 + 6 = 8. 

In terms of the elements of the confusion matrix, we have the following counts for class 
j=0,...,c-1: 


True positive tp; = Mj; 
False positive fp; = >, M;;, (column sum) 
k#j 
False negative fn; = >, Mix. (row sum) 
k#j 
True negative tn; =n- fn; — fp; — tp; 


Note that in the binary classification case (c = 2), and using the indicator loss function, 
the misclassification error (7.4) can be written as 
fp; + fn; 


error; = (7.5) 


This does not depend on which of the two classes is considered, as fp) + fno = fp, + fn. 
Similarly, the accuracy measures the fraction of correctly classified objects: 


accuracy; = | — error; = (7.6) 


In some cases, classification error (or accuracy) alone is not sufficient to adequately 
describe the effectiveness of a classifier. As an example, consider the following two classi- 
fication problems based on a fingerprint detection system: 


1. Identification of authorized personnel in a top-secret military facility. 


2. Identification to get an online discount for some retail chain. 


Both problems are binary classification problems. However, a false positive in the first 
problem is extremely dangerous, while a false positive in the second problem will make 
a customer happy. Let us examine a classifier in the top-secret facility. The corresponding 
confusion matrix is given in Table 7.2. 


Table 7.2: Confusion matrix for authorized personnel classification. 





Predicted 
Actual authorized non-authorized 
authorized 100 400 


non-authorized 50 100,000 
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From (7.6), we conclude that the accuracy of classification is equal to 


tp + tn 100 + 100, 000 


Bee ge ee a 
tp+tn+fp+fn 100+ 100,000 + 50 + 400 i 


accuracy = 


However, we can see that in this particular case, accuracy is a problematic metric, since 
the algorithm allowed 50 non-authorized personnel to enter the facility. One way to deal 
with this issue is to modify the loss function to give a much higher loss to non-authorized 
access. Thus, instead of an (indicator) loss matrix L = I (identity matrix), we could for 
example take the loss matrix 


0 1 
oi fe i 
An alternative approach is to keep the indicator loss function and consider additional clas- 


sification metrics. Below we give a list of commonly used metrics. For simplicity we call 
an object whose actual class is j a “j-object”. 


e The precision (also called positive predictive value) is the fraction of all objects 
classified as j that are actually j-objects. Specifically, 


tp, 
tp; + fp; 





precision ; = 


e The recall (also called sensitivity) is the fraction of all j-objects that are correctly 
classified as such. That is, 
tp; 
recall; = ———_. 
tp; + fn; 
e The specificity measures the fraction of all non- j-objects that are correctly classified 
as such. Specifically, 
tn; 
fp, + tn; 





specificity ; = 


e The Fg score is a combination of the precision and the recall and is used as a single 
measurement for a classifier’s performance. The Fg score is given by 


i (B + 1) tp, 
Pi- B+ I tp; + 6 fn, + fp; 


For $ = 0 we obtain the precision and for f — oo we obtain the recall. 


The particular choice of metric is clearly application dependent. For example, in the 
classification of authorized personnel in a top-secret military facility, suppose we have 
two classifiers. The first (Classifier 1) has a confusion matrix given in Table 7.2, and the 
second (Classifier 2) has a confusion matrix given in Table 7.3. Various metrics for these 
two classifiers are show in Table 7.4. In this case we prefer Classifier 1, which has a much 
higher precision. 
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Table 7.3: Confusion matrix for authorized personnel classification, using a different clas- 
sifier (Classifier 2). 


Predicted 


Actual Authorized Non-Authorized 


authorized 50 10 
non-authorized 450 100,040 





Table 7.4: Comparing the metrics for the confusion matrices in Tables 7.2 and 7.3. 


Metric Classifier 1 Classifier 2 


accuracy 9.955 x 107! 9.954 x 107! 
precision 6.667 x 107! 1.000 x 107! 
recall 2.000 x 107! 8.333 x 107! 
specificity 9.995 x 107! 9.955 x 107! 
F 3.077 x 107! 1.786 x 107! 





@ Remark 7.1 (Multilabel and Hierarchical Classification) In standard classification 
the classes are assumed to be mutually exclusive. For example a satellite image could 
be classified as “cloudy”, “clear”, or “foggy”. In multilabel classification the classes (often 
called labels) do not have to be mutually exclusive. In this case the response is a subset 
Y of some collection of labels {0,...,c — 1}. Equivalently, the response can be viewed as 
a binary vector of length c, where the y-th element is 1 if the response belongs to label y 
and 0 otherwise. Again, consider the satellite image example and add two labels, such as 
“road” and “river” to the previous three labels. Clearly, an image can contain both a road 
and a river. In addition, the image can be clear, cloudy, or foggy. 

In hierarchical classification a hierarchical relation between classes/labels is taken into 
account during the classification process. Usually, the relations are modeled via a tree or a 
directed acyclic graph. A visual comparison between the hierarchical and non-hierarchical 
(flat) classification tasks for satellite image data is presented in Figure 7.1. 


root 
Lo N root 
rural urban | 
/ i | rural barn farm urban skyscraper 


farm barn skyscraper 


Figure 7.1: Hierarchical (left) and non-hierarchical (right) classification schemes. Barns 
and farms are common in rural areas, while skyscrapers are generally located in cities. 
While this relation can be clearly observed in the hierarchical model scheme, the connec- 
tion is missing in the non-hierarchical design. 
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In multilabel classification, both the prediction y := g(x) and the true response Y are 
subsets of the label set {0,...,c— 1}. A reasonable metric is the so-called exact match ratio, 


defined as = 
at HY; = Y} 
exact match ratio = Zit Ui = Yd 
n 
The exact match ratio is rather stringent, as it requires a full match. In order to consider 


partial correctness, the following metrics could be used instead. 


e The accuracy is defined as the ratio of correctly predicted labels and the total number 
of predicted and actual labels. The formula is given by 


È lWin Y| 
accuracy = ———____ 
Dia Y; U Y; 


e The precision is defined as the ratio of correctly predicted labels and the total number 
of predicted labels. Specifically, 


ie WN y| 


= (7.7) 
Di Wil 


precision = 


e The recall is defined as the ratio of correctly predicted labels and the total number of 
actual labels. Specifically, 


ARVAA A 


UA ee. 


recall = 


e The Hamming loss counts the average number of incorrect predictions for all classes, 
calculated as 
n c-l 
dD My E€ YA Ly ¢ Y) + Ly g YI Uy EY. 


i 1 
Hamming = — 
nc 

i=l y=0 


7.3 Classification via Bayes’ Rule 


We saw from Theorem 7.1 that the optimal classifier for classes 0,...,c — 1 divides the 
feature space into c regions, depending on f(y|x): the conditional pdf of the response Y 
given the feature vector X = x. In particular, if f(y|x) > f(z|x) for all z + y, the feature 
vector x is classified as y. Classifying feature vectors on the basis of their conditional class 
probabilities is a natural thing to do, especially in a Bayesian learning context; see Sec- 
tion 2.9 for an overview of Bayesian terminology and usage. Specifically, the conditional 
probability f(y |x) is interpreted as a posterior probability, of the form 


fy|x) x f(x|y)fO), (7.9) 


where f(x |y) is the likelihood of obtaining feature vector x from class y and f(y) is the 
prior probability! of class y. By making various modeling assumptions about the prior 





'Here we have used the Bayesian notation convention of “overloading” the notation f. 
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(e.g., all classes are a priori equally likely) and the likelihood function, one obtains the 
posterior pdf via Bayes’ formula (7.9). A class y is then assigned to a feature vector x 
according to the highest posterior probability; that is, we classify according to the Bayes 
optimal decision rule: 

y = argmax f(y |x), (7.10) 


which is exactly (7.2). Since the discrete density f(y|x), y = 0,...,c — 1 is usually not 
known, the aim is to approximate it well with a function g(y | x) from some class of func- 
tions G. Note that in this context, g(-|x) refers to a discrete density (a probability mass 
function) for a given x. 

Suppose a feature vector x = [x),...,x,]" of p features has to be classified into one of 
the classes 0,...,c — 1. For example, the classes could be different people and the features 
could be various facial measurements, such as the width of the eyes divided by the distance 
between the eyes, or the ratio of the nose height and mouth width. In the naive Bayes 
method, the class of approximating functions G is chosen such that g(x |y) = g(x, |y): 
g(x, |y), that is, conditional on the label, all features are independent. Assuming a uniform 
prior for y, the posterior pdf can thus be written as 


p 
80 lx) x [|] gaily), 
j=l 

where the marginal pdfs g(x;|y), j = 1,...,p belong to a given class of approximating 
functions G. To classify x, simply take the y that maximizes the unnormalized posterior 
pdf. 

For instance, suppose that the approximating class G is such that (X; |y) ~ Ny; o°), 
y=0,...,c-1, j=1,..., p. The corresponding posterior pdf is then 


P a 2 o 
golen sexp[—$ X; SE) = oof- 


= o? 2 P 


where u, := [Hyi ines Hyp] and 8 := {Ho;. -> flay a} collects all model parameters. The 
probability g 10, x) is maximal when ||x — Lll is minimal. Thus y = argmin, ||x — Ll is 
the classifier that maximizes the posterior probability. That is, classify x as y when y, is 
closest to x in Euclidean distance. Of course, the parameters (here, the {x} and g?) are 
unknown and have to be estimated from the training data. 

We can extend the above idea to the case where also the variance o? depends on the 
class y and feature j, as in the next example. 


E Example 7.1 (Naive Bayes Classification) Table 7.5 lists the means jz and standard de- 
viations © of p = 3 normally distributed features, for c = 4 different classes. How should 
a feature vector x = [1.67, 2.00, 4.23] be classified? The posterior pdf is 


1 (xj - My” 
=] j yj 
8O10, x) x (Ty1Oy20y3) exp -; 2, o, i 


c-1 


where 0 := {0 ;, M;}ṣ-zọ again collects all model parameters. The (unscaled) values for 
ge(y10,x), y = 0, 1,2,3 are 53.5, 0.24, 8.37, and 3.5 x 1076, respectively. Hence, the feature 
vector should be classified as 0. The code follows. 
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Table 7.5: Feature parameters. 
Feature 1 Feature 2 Feature 3 
Class H o H o H o 
0 16 0.1 2.4 0.5 4.3 0.2 
1 1.5 0.2 2.9 0.6 61 0.9 
2 1.8 0.3 2.5 03 4.2 0.3 
3 1.1 0.2 3.1 0.7 5.6 0.3 
naiveBayes .py 
import numpy as np 
x = np.array([1.67,2,4.23]).reshape(1, 3) 
mu = np.array([1.6, i 
sig = np.array( 0. 
0. 
0 
0 
g = lambda y: 1/np.prod(sig[y,:]) * np.exp(C 
-0.5*np.sum((x-mu[y,:])**2/sig[y,:]**2)); 
for y in range(0,4): 
print ('{:3.2e}'.format(g(Cy))) 
5.35e+01 
2.42e-01 
8.37e+00 
3.53e-06 
E 
7.4 Linear and Quadratic Discriminant Analysis 
The Bayesian viewpoint for classification of the previous section (not limited to naïve 
Bayes) leads in a natural way to the well-established technique of discriminant analysis. DISCRIMINANT 
We discuss the binary classification case first, with classes 0 and 1. S 
We consider a class of approximating functions G such that, conditional on the class 
y € {0, 1}, the feature vector X = [X;, ...,Xp]" has a NGL, Ł,) distribution (see (2.33)): Is 45 


1 oe 
(x |0,y) = ——— e HSA) oy ER’, ye {0,1}, (7.11) 


(27)? [Xs 


where 8 = {@;, u; © Ta collects all model parameters, including the probability vector œ 
(that is, >); @; = 1 and a; > 0) which helps define the prior density: g0 |0) = ay, y € {0, 1}. 
Then, the posterior density is 


gO 10, x) x ay X g(x|4,y), 
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and, according to the Bayes optimal decision rule (7.10), we classify x to come from class 
0 if aog(x|0,0) > a; g(x|9, 1) or, equivalently (by taking logarithms) if, 


1 1 Ty-1 1 1 Ty-1 
ln œo — 5 In [ol — z% — Ho) Xo (X — Ho) > Inay — 5 in lel = ha -= 44) Xj (x - py). 
The function 
1 1 _ 
6,(x) = Ina, — 5 In|X,| - zC -uY Eœ- u), x eR? (7.12) 


is called the quadratic discriminant function for class y = 0,1. A point x is classified to 
class y for which 6,(x) is largest. The function is quadratic in x and so the decision bound- 
ary {x € R? : 69(x) = ôı(x)} is quadratic as well. An important simplification arises for the 
case where the assumption is made that Xp = X; = X. Now, the decision boundary is the 
set of x for which 


1 1 
In a — ae = fy) E(x — Ho) = Ina, - ha - pg) E(x — fy). 


Expanding the above expression shows that the quadratic term in x is eliminated, giving a 
linear decision boundary in x: 


1 1 
In ao — sHoX ‘Ho +x°X |p, = Ina, - shee +x Dw. 
The corresponding linear discriminant function for class y is 
1 
6,(x) = Ina, — shy Eu, +x Eu, xeR’. (7.13) 
E Example 7.2 (Linear Discriminant Analysis) Consider the case where ag = a, = 1/2 


and 
| 2 07 _ fo f2 
Slo7 2P #o=lol “= Ial: 


The distribution of X is a mixture of two bivariate normal distributions. Its pdf, 


1 1 
78x18, = 0) + 58Œl0,y = 1), 


is depicted in Figure 7.2. 





Figure 7.2: A Gaussian mixture density where the two mixture components have the same 
covariance matrix. 
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We used the following Python code to make this figure. 


LDAmixture.py 


import numpy as np, matplotlib.pyplot as plt 
from scipy.stats import multivariate_normal 
from mpl_toolkits.mplot3d import Axes3D 

from matplotlib.colors import LightSource 


muð, mul = np.array([9,0]), np.array([2,4]) 
Sigma = np.array([[2,9.7],[0.7, 2]]) 

xX, y = np.mgrid[-4:6:150j,-5:8:150j] 

mvn0 multivariate_normal( mu®, Sigma ) 
mvn1 multivariate_normal( mul, Sigma ) 


= np.hstack((x.reshape(-1,1),y.reshape(-1,1))) 
0.5*mvn0.pdf(xy).reshape(x.shape) + 0.5*mvnl.pdf(xy).reshape(x. 
shape) 


fig plt.figure() 

ax = fig.gca(projection='3d') 

ls = LightSource(azdeg=180, altdeg=65) 

cols = ls.shade(z, plt.cm.winter) 

surf ax.plot_surface(x, y, z, rstride=1, cstride=1, linewidth=0, 
antialiased=False, facecolors=cols) 





plt.show() 


The following Python code, which imports the previous code, draws a contour plot of 
the mixture density, simulates 1000 data points from the mixture density, and draws the 
decision boundary. To compute and display the linear decision boundary, let [a,,a2]' = 
227! (wy — Ho) and b = poX tHo — H X~ tu. Then, the decision boundary can be written as 
aX, + ax + b = Q or, equivalently, x2 = —(aıxı + b)/ay. We see in Figure 7.3 that the 
decision boundary nicely separates the two modes of the mixture density. 


from LDAmixture import * 
from numpy.random import rand 
from numpy.linalg import inv 


fig = plt.figure() 
plt.contourf(x, y,z, cmap=plt.cm.Blues, alpha= 0.9,extend='both') 
plt.ylim(-5.0,8.0) 
plt.xlim(-4.0,6.0) 
1000 


Crand(M,1) < 0.5) 
i in range(0,M): 
if rii]: 
u = np.random.multivariate_normal (mu0,Sigma,1) 
plt.plot(u[0][0],u[0][1],'.r',alpha = 0.4) 
else: 


u = np.random.multivariate_normal (mu1,Sigma,1) 
plt.plot(u[0][0],u[0][1],'+k',alpha = 0.6) 





262 


7.4. Linear and Quadratic Discriminant Analysis 





a = 2*inv(Sigma) @ (mul-muQ) ; 

b = ( muQ.reshape(1,2) @ inv(Sigma) @ muQ.reshape (2,1) 
- mul.reshape(1,2) @ inv(Sigma) @mul.reshape(2,1) ) 

xx = np.linspace(-4,6,100) 

yy = (-(a[0]*xx +b)/a[1]) [0] 

plt.plot(xx,yy,'m') 

plt.show() 














T T T ji 


—4 —2 0 2 4 6 


Figure 7.3: The linear discriminant boundary lies between the two modes of the mixture 
density and is linear. 


To illustrate the difference between the linear and quadratic case, we specify different 
covariance matrices for the mixture components in the next example. 


E Example 7.3 (Quadratic Discriminant Analysis) As in Example 7.2 we consider a 
mixture of two Gaussians, but now with different covariance matrices. Figure 7.4 shows 
the quadratic decision boundary. The Python code follows. 





—— = 
= _—— a 














Figure 7.4: A quadratic decision boundary. 
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import numpy as np 
import matplotlib.pyplot as plt 
from scipy.stats import multivariate_normal 


mul np.array([0,0]) 

mu2 np.array([2,2]) 

Sigmal = np.array([[1,0.3],[0.3, 1]]) 
Sigma2 = np.array([[0.3,0.3],[0.3, 1]]) 
Xx, y = np.mgrid[-2:4:150j,-3:5:150j] 

mvn1 = multivariate_normal( mul, Sigmal ) 
mvn2 multivariate_normal( mu2, Sigma2 ) 


np.hstack((x.reshape(-1,1),y.reshape(-1,1))) 
C 0.5*mvni.pdf(xy).reshape(x.shape) + 
0.5*mvn2.pdf(xy).reshape(x.shape) ) 
plt.contour(x,y,z) 


C 0.5*mvni.pdf(xy).reshape(x.shape) - 
0.5*mvn2.pdf(xy) .reshape(x. shape) ) 
plt.contour(x,y,z1, levels=[0],linestyles ='dashed', 
linewidths = 2, colors = ‘'m') 
plt.show() 





Of course, in practice the true parameter 0 = {a;, Xj, 4;}‘_, 1s not known and must be 
estimated from the training data — for example, by minimizing the cross-entropy training 
loss (4.4) with respect to 0: mS 123 


1 n 1 n 
— X | Loss( xi, y), 8 y:10)) =- X In gx, yi1 8), 
i= ase 


where i i 
In g(x, y|0) = Ina, — > In |£] — J (x- u) E (x — Hy) - 5 In(2n). 


The corresponding estimates of the model parameters (see Exercise 2) are: 


~ _ Ny 
a=— 

7 n 
ou 1 
te Da (1.14) 
os 1 _ _ 
B= 7 Dutt - a; -F 


for y =0,...,c— 1, where n, := X; l{y; = y}. For the case where X, = X for all y, we 
have È = EGE, 

When c > 2 classes are involved, the classification procedure carries through in exactly 
the same way, leading to quadratic and linear discriminant functions (7.12) and (7.13) for 
each class. The space R? now is partitioned into c regions, determined by the linear or 
quadratic boundaries determined by each pair of Gaussians. 
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SPHERE THE DATA 


ts 373 


ms 433 


For the linear discriminant case (that is, when XZ, = X for all y), it is convenient to first 
“whiten” or sphere the data as follows. Let B be an invertible matrix such that X = BB', 
obtained, for example, via the Cholesky method. We linearly transform each data point x 


to x’ := B-'x and each mean H, to Hy, = B'i, y=0,...,c— 1. Let the random vector X 
be distributed according to the mixture pdf 
gx(x10) := 2o eb =E; m), 


are 


Then, by the transformation Theorem C.4, the vector X’ = B~!X has density 





; 8x(X) Qy =} (ew) (BBT)! (e-p,) 
iis (je = = e7? OH, i 
j B] 4 Vn 
c-1 =l 
Qy P 


5 Jx’ -H 








Y e53 Hh) ys 
(27)? ar 


y=0 


This is the pdf of a mixture of standard p-dimensional normal distributions. The name 
“sphering” derives from the fact that the contours of each mixture component are perfect 
spheres. Classification of the transformed data is now particularly easy: classify x as y := 
argmin, { ||x’ — pr - 2 In ay}. Note that this rule only depends on the prior probabilities and 
the distance from x’ to the transformed means {x}. This procedure can lead to a significant 
dimensionality reduction of the data. Namely, the data can be projected onto the space 
spanned by the differences between the mean vectors {Hy}. When there are c classes, this 
is a (c — 1)-dimensional space, as opposed to the p-dimensional space of the original data. 
We explain the precise ideas via an example. 


E Example 7.4 (Classification after Data Reduction) Consider an equal mixture of 
three 3-dimensional Gaussian distributions with identical covariance matrices. After spher- 
ing the data, the covariance matrices are all equal to the identity matrix. Suppose the mean 
vectors of the sphered data are u, = [2,1,—-3]", uw, = [1,-4,0]", and u; = [2,4,6]'. The 
left panel of Figure 7.5 shows the 3-dimensional (sphered) data from each of the three 
classes. 

















Figure 7.5: Left: original data. Right: projected data. 


The data are stored in three 1000 x 3 matrices X1, X2, and X3. Here is how the data was 
generated and plotted. 
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datared.py 


import numpy as np 

from numpy.random import randn 

import matplotlib.pyplot as plt 

from mpl_toolkits.mplot3d import Axes3D 


n=1000 

mul = np.array([2,1,-3]) 

mu2 = np.array([1,-4,0]) 

mu3 np.array([2,4,0]) 

X1 = randn(n,3) + mul 

X2 randn(n,3) + mu2 

X3 randn(n,3) + mu3 

fig = plt.figure() 

ax = fig.gca(projection='3d',) 
ax.plot(X1[:,0],X1[:,1],X1[:,2],'r.',alpha=0.5,markersize=2) 
ax. plot (X2[:,0],X2[:,1],X2[:,2],'b.',alpha=0.5,markersize=2) 
ax.plot(X3[: ,0],X3[:,1],X3[:,2],'g.',alpha=0.5,markersize=2) 
ax.set_xlim3d(-4, 6) 

ax.set_ylim3d(-5,5) 

ax.set_zlim3d(-5,2) 

plt.show() 


Since we have equal mixtures, we classify each data point x according to the closest 
distance to 44, My, Or H3. We can achieve a reduction in the dimensionality of the data by 
projecting the data onto the two-dimensional affine space spanned by the {j;}; that is, all 
vectors are of the form 


Hı + Bi (Uy — My) + 243 — My), Bi 82 ER. 


In fact, one may just as well project the data onto the subspace spanned by the vectors 
My, = My — My, and py, = H3 — Hi. Let W = [fp,, H3] be the 3 x 2 matrix whose columns 
are fly, and plz,. The orthogonal projection matrix onto the subspace W spanned by the 
columns of W is (see Theorem A.4): 


P = WW* = W(W'W) WT. 
Let UDV' be the singular value decomposition of W. Then P can also be written as 
P = UDD" D) D'U". 


Note that D has dimension 3 x 2, so is not square. The first two columns of U, say u: 
and u2, form an orthonormal basis of the subspace W. What we want to do is rotate this 
subspace to the x— y plane, mapping uw, and wz to [1,0,0]' and [0, 1, 0]", respectively. This 
is achieved via the rotation matrix U~! = U7, giving the skewed projection matrix 


R=U'P=DQ'D)'D'U., 


whose 3rd row only contains zeros. Applying R to all the data points, and ignoring the 
3rd component of the projected points (which is 0), gives the right panel of Figure 7.5. 
We see that the projected points are much better separated than the original ones. We have 
achieved dimensionality reduction of the data while retaining all the necessary information 
required for classification. Here is the rest of the Python code. 
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dataproj.py 
from datared import * 
from numpy.linalg import svd, pinv 
mu21 = (mu2 - mul).reshape(3,1) 
mu31 = (mu3 - mul).reshape(3,1) 
W = np.hstack((mu21, mu31)) 
U,_,_ = svd(W) # we only need U 
W @ pinv(W) 
U.T @P 
= (R@ KI DT 
= CR AO DT 
= (R @ X3.T).T 
.plot (RX1[:,0],RX1[:,1],'b.',alpha=0.5,markersize=2) 
-plot(RX2[:,0],RX2[:,1],'g.',alpha=0.5,markersize=2) 
.plot CRX3[: ,0],RX3[:,1],'r.',alpha=0.5,markersize=2) 
. Show () 
7.5 Logistic Regression and Softmax Classification 
re 204 In Example 5.10 we introduced the logistic (logit) regression model as a generalized linear 
model where, conditional on a p-dimensonal feature vector x, the random response Y has 
a Ber(h(x' B)) distribution with h(u) = 1/(1 +e“). The parameter 8 was then learned from 
the training data by maximizing the likelihood of the training responses or, equivalently, 
rs 123 by minimizing the supervised version of the cross-entropy training loss (4.4): 


LOG-ODDS RATIO 


MULTI-LOGIT 


1 n 
=. > In g(y; |, Xi), 
nan 


where g(y = 1 |8, x) = 1/(1 +e™®) and g(y = 0|8, x) =e* 4/(1 + e™®). In particular, 


we have 
go =1/B,x) | 


80 = O|B, x) 

In other words, the log-odds ratio is a linear function of the feature vector. As a con- 
sequence, the decision boundary {x : g(y = 0|B,x) = (y = 1|B,x)} is the hyperplane 
x'B = 0. Note that x typically includes the constant feature. If the constant feature is con- 
sidered separately, that is x = [1,x']", then the boundary is an affine hyperplane in x. 

Suppose that training on T = {(x;, yi} yields the estimate 6 with the corresponding 
learner g,(y = 1|x) =1/( + e™™ 2), The learner can be used as a pre-classifier from which 
we obtain the classifier 1{g,(y = 1|x) > 1/2} or, equivalently, 


xB. (7.15) 


y := argmax g,(y = j|x), 
Je{0,1} 
in accordance with the fundamental classification rule (7.2). 
The above classification methodology for the logit model can be generalized to the 
multi-logit model where the response takes values in the set {0,...,c — 1}. The key idea is 
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to replace (7.15) with 
gy = j|W,b,x) _ = 
I = A =1,...,c-1, 7.16 
"gy = O1W,B.x) PP J c aa 
where the matrix W € R‘'*?") and vector b € R“! reparameterize all £, € R?” such that 
(recall x = [1,x"]"): 
Wx +b = [B ... B1] x. 
Observe that the random response Y is assumed to have a conditional probability distri- 
bution for which the log-odds ratio with respect to class j and a “reference” class (in this 
case 0) is linear. The separating boundaries between two pairs of classes are again affine 
hyperplanes. 
The model (7.16) completely specifies the distribution of Y, namely: 
exp(z 
gy |W, b,x) = meaty y=0,...,c-1, 
where z; is an arbitrary constant, say 0, corresponding to the “reference” class y = 0, and 
[Zo,.--5%c]' = Wx+b. 
Note that 2 | W, b, x) is the (y + 1)-st component of a = softmax(z), where 
exp(Z) 
softmax : z => ————— 
Diz EXp(Zx) 
is the softmax function and z = [z,,...,Z]'. Finally, we can write the classifier as SOFTMAX 
y = argmax j+. 
je{0,...,c-1} 
In summary, we have the sequence of mappings transforming the input x into the output y: 
x > Wx +b = softmax(z) > argmax aj; > Y. 
€{0,...,c—1} 
In Example 9.4 we will revisit the multi-logit model and reinterpret this sequence of map- ms 333 
pings as a neural network. In the context of neural networks, W is called a weight matrix 
and b is called a bias vector. 
The parameters W and b have to be learned from the training data, which involves 
minimization of the supervised version of the cross-entropy training loss (4.4): rs 123 
1x 1x 
= py Loss(f(yi| i), gi | W, b, x;)) = -— > In g; |W, b, x;). 
n or 
Using the softmax function, the cross-entropy loss can be simplified to: 
Loss(f(y |x), 801W, b, x) = -zy + In) ) exp). (7.17) 
k=1 
The discussion on training is postponed until Chapter 9, where we reinterpret the multi- 
logit model as a neural net, which can be trained using the limited-memory BFGS method 
(Exercise 11). Note that in the binary case (c = 2), where there is only one vector f to nse 352 


be estimated, Example 5.10 already established that minimization of the cross-entropy 
training loss is equivalent to likelihood maximization. 
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K-NEAREST 
NEIGHBORS 


ns 142 


7.6 K-Nearest Neighbors Classification 


Let T = {(x;, y;)}_, be the training set, with y; € {0,...,c — 1}, and let x be a new feature 
vector. Define x1), X02), . . ., Xn) as the feature vectors ordered by closeness to x in some dis- 
tance dist(x, x;), e.g., the Euclidean distance ||x—x’||. Let T(x) := {(xa), Yay). - 5 (XH), Yæ) 
be the subset of t that contains K feature vectors x; that are closest to x. Then the K-nearest 
neighbors classification rule classifies x according to the most frequently occurring class 
labels in t(x). If two or more labels receive the same number of votes, the feature vector 
is classified by selecting one of these labels randomly with equal probability. For the case 
K = 1 the set t(x) contains only one element, say (x’, y’), and x is classified as y’. This 
divides the space into n regions 


Ri = {x : dist(x, x;) < dist(x,x;), j + i}, i=1,...,n. 


For a feature space R? with the Euclidean distance, this gives a Voronoi tessellation of the 
feature space, similar to what was done for vector quantization in Section 4.6. 


E Example 7.5 (Nearest Neighbor Classification) The Python program below simulates 
80 random points above and below the line x. = xı. Points above the line x. = x, have 
label O and points below this line have label 1. Figure 7.6 shows the Voronoi tessellation 
obtained from the 1-nearest neighbor classification. 








Figure 7.6: The 1-nearest neighbor algorithm divides up the space into Voronoi cells. 


nearestnb. py 


import numpy as np 


from numpy.random import rand,randn 
import matplotlib.pyplot as plt 
from scipy.spatial import Voronoi, voronoi_plot_2d 
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np.random.seed(12345) 


M = 80 
x = randn(M,2) 
y = np.zeros(M) # pre-allocate list 


for i in range(M): 
if rand()<0.5: 
x{i,1l], ylil 
else: 
x[i,1], ylil 


x[i,0] + np.abs(randn()), © 


x[i,0] - np.abs(randn()), 1 


vor = Voronoi (x) 
plt_options = {'show_vertices':False, 'show_points':False, 
"line_alpha':9.5} 
fig = voronoi_plot_2d(vor, **plt_options) 
plt.plot(x[y==0,0], x[y==0,1],'bo', 
x[y==1,0], x[y==1,1],'rs', markersize=3) 





7.7 Support Vector Machine 


Suppose we are given the training set T = {(x;, y;)}/_,, where each response? y; takes either 
the value —1 or 1, and we wish to construct a classifier taking values in {-1, 1}. As this 
merely involves a relabeling of the 0-1 classification problem in Section 7.1, the optimal 
classification function for the indicator loss, 1{y # Y}, is, by Theorem 7.1, equal to 


. 1 if P[Y=1|X =x] 2 1/2, 
co 
1 if P[Y=1|X =x] <1/2. 


It is not difficult to show, see Exercise 5, that the function g* can be viewed as the minimizer 
of the risk for the hinge loss function, Loss(y,y) = (1 — yy), := max{0, 1 — yy}, over all 
prediction functions g (not necessarily taking values only in the set {—1, 1}). That is, 


g“ =argminE(1 — Y g(X)),. (7.18) 
g 
Given the training set T, we can approximate the risk (e) = E (1 — Y g(X))+ with the train- 
ing loss 


1 n 
h(g) = — 2, (1 —y;g(x))s, 


and minimize this over a (smaller) class of functions to obtain the optimal prediction func- 
tion g+. Finally, as the prediction function g, generally is not a classifier by itself (it usually 
does not only take values —1 or 1), we take the classifier 


sign g-(x). 





?The reason why we use responses —1 and 1 here, instead of 0 and 1, is that the notation becomes easier. 
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OPTIMAL DECISION 
BOUNDARY 


ts 231 


is 232 


Therefore, a feature vector x is classified according to 1 or —1 depending on whether 
g(x) > 0 or < 0, respectively. The optimal decision boundary is given by the set of x for 
which g,(x) = 0. 

Similar to the cubic smoothing spline or RKHS setting in (6.19), we can consider find- 
ing the best classifier, given the training data, via the penalized goodness-of-fit optimiza- 
tion: 

1 n 
in — 1 -yi g(x]. +F liell 
oe 2! vi 8d]+ + 7 Illa, 
for some regularization parameter y. It will be convenient to define y := 2ny and to solve 
the equivalent problem 


.. Yue 
1 = Vy: r — . 
ain > [= yig) + 5 lel 


We know from the Representer Theorem 6.6 that if x is the reproducing kernel cor- 
responding to H, then the solution is of the form (assuming that the null space Ho has a 
constant term only): 


B(x) = ao + X ai Kli, x). (7.19) 


i=1 


Substituting into the minimization expression yields the analogue of (6.21): 
min X [1 —y,(a + {Ka})]. + Y a'Ka, (7.20) 
=) 2 


where K is the Gram matrix. This is a convex optimization problem, as it is the sum of a 


convex quadratic and piecewise linear term in œ. Defining 4; := ya;/y;, i = 1,...,n and 
A := [A,,...,An]', we show in Exercise 10 that the optimal œ and qo in (7.20) can be 
obtained by solving the “dual” convex optimization problem 
n 1 n n 
max hea AAYY j K(Xi, Xj) 
à 3 ay dy —_—* (7.21) 


subject to: A'y=0,0<A<1, 


and ag = yj — Diz @% K(X;, xj) for any j for which A; € (0, 1). In view of (7.19), the optimal 
prediction function (pre-classifier) g+ is then given by 


n 1 n 

&(X) = ao + > Qj K(X;,X) =a + — > Yidi K(X, X). (7.22) 

: y= 
i=l i=1 


To mitigate possible numerical problems in the calculation of ap it is customary to take 


an overall average: 
1 n 
Qo = — iT Qi K(Xi, Xj)? 
, AG on ») 


JET 
where J := {j : A; € (0, 1)}. 
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Note that, from (7.22), the optimal pre-classifier g(x) and the classifier sign g(x) only 
depend on vectors x; for which 4; # 0. These vectors are called the support vectors of the 
support vector machine. It is also important to note that the quadratic function in (7.21) 
depends on the regularization parameter y. By defining v; := A;/y, i = 1,...,n, we can 
rewrite (7.21) as 


n 


| 
me oS 2 ViV YY j K(Xi, Xj) — > Vi 
LJ 


= (7.23) 


subject to: X voi =0, Osv <l/y=:C, i=1,...,n. 
i=1 
For perfectly separable data, that is, data for which an affine plane can be drawn to perfectly 


separate the two classes, we may take C = ov, as explained below. Otherwise, C needs to 
be chosen via cross-validation or a test data set, for example. 


Geometric interpretation 


For the linear kernel function x(x, x’) = x'x’, we have 
g(x) = By + B'x, 


with By = œo and B = y~! 1, Aix; = YL, œx, and so the decision boundary is an affine 
plane. The situation is illustrated in Figure 7.7. The decision boundary is formed by the 
points x such that g,(x) = 0. The two sets {x : g-(x) = —1} and {x : g,(x) = 1} are called 
the margins. The distance from the points on a margin to the decision boundary is 1/||A|l. 








* m g-(x) =-1 











Figure 7.7: Classifying two classes (red and blue) using SVM. 
Based on the “multipliers” {4;}, we can divide the training samples {(x;, y;)} into three 
categories (see Exercise 11): 


e Points for which 4; € (0, 1). These are the support vectors on the margins (green 
encircled in the figure) and are correctly classified. 
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e Points for which 4; = 1. These points, which are also support vectors, lie strictly 
inside the margins (points 1, 2, and 3 in the figure). Such points may or may not be 
correctly classified. 


e Points for which 4; = 0. These are the non-support vectors, which all lie outside the 
margins. Every such point is correctly classified. 


If the classes of points {x; : y; = 1} and {x; : y; = —1} are perfectly separable by some 
affine plane, then there will be no points strictly inside the margins, so all support vectors 
will lie exactly on the margins. In this case (7.20) reduces to 


min |[6l/? 
Pho (7.24) 
subject to: y;(Bo +x; ß) > 1, i=1,...,n, 


using the fact that ag = By) and Ka = XX"a@ = XP. We may replace min |||? in (7.24) with 
max 1/||B||, as this gives the same optimal solution. As 1/|||| is equal to half the margin 
width, the latter optimization problem has a simple interpretation: separate the points via 
an affine hyperplane such that the margin width is maximized. 


E Example 7.6 (Support Vector Machine) The data in Figure 7.8 was uniformly gener- 
ated on the unit disc. Class-1 points (blue dots) have a radius less than 1/2 (y-values 1) and 
class-2 points (red crosses) have a radius greater than 1/2 (y-values —1). 


0.8 F 
0.6 F 
0.4 F 
0.2 F 


-0.2 f 
-0.4 F 
-0.6 f 
-0.8 f 











-1 -0.5 0 0.5 1 


Figure 7.8: Separate the two classes. 


Of course it is not possible to separate the two groups of points via a straight line in 
R?. However, it is possible to separate them in R? by considering three-dimensional feature 
vectors Z = [Z1,22,23]' = [x1, X2, x + Pole For any x € R’, the corresponding feature vec- 
tor z lies on a quadratic surface. In this space it is possible to separate the {z;} points into 
two groups by means of a planar surface, as illustrated in Figure 7.9. 
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Figure 7.9: In feature space R? the points can be separated by a plane. 


We wish to find a separating plane in R? using the transformed features. The following 
Python code uses the SVC function of the sklearn module to solve the quadratic optimiz- 
ation problem (7.23) (with C = oo). The results are summarized in Table 7.6. The data is 
available from the book’s GitHub site as svmcirc.csv. 


svmquad. py 


import numpy as np 
from numpy import genfromtxt 
from sklearn.svm import SVC 


data = genfromtxt('svmcirc.csv', delimiter=',') 
x = data[:,[0,1]] #vectors are rows 
y = data[:,[2]].reshape(len(x),) #labels 


= np.sum(np.power(x,2),axis=1).reshape(len(x) ,1) 
np.hstack((x,tmp)) 


clf = SVCCC = np.inf, kernel='linear') 
clf.fit(z,y) 


print("Support Vectors \n", clf.support_vectors_) 
print("Support Vector Labels ",y[clf.support_]) 
print ("Nu",clf.dual_coef_) 
printC"Bias",clf.intercept_) 


Support Vectors 

[LE 0.038758 0.53796 0.29090314] 

[-0.49116 -0.20563 0.28352184] 

[-0.45068 -0.04797 0.20541358] 

[-0.061107 -0.41651 0.17721465]] 
Support Vector Labels [-1. -1. 1. 1.] 
Nu [[ -46.49249413 -249.01807328 265.31805855 30.19250886]] 
Bias [5.617891] 
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Table 7.6: Optimal support vector machine parameters for the R° data. 


T 


Z y @ = vy 

0.0388 0.5380 0.2909 -1  -—46.4925 
—0.4912 -0.2056 0.2835 -1 -249.0181 
—0.4507 -0.0480 0.2054 1 265.3181 
—0.0611 -0.4165 0.1772 1 30.1925 





It follows that the normal vector of the plane is 


B= > a;Z; = [-0.9128, 0.8917, -24.2764]", 
ieS 

where S is the set of indices of the support vectors. We see that the plane is almost per- 
pendicular to the z1, z2 plane. The bias term fp can also be found from the table above. In 
particular, for any x" and y in Table 7.6, we have y — B'z = By = 5.6179. 

To draw the separating boundary in R? we need to project the intersection of the sep- 
arating plane with the quadratic surface onto the z1, z) plane. That is, we need to find all 
points (z;, Z2) such that 


5.6179 — 0.91282; + 0.8917z2 = 24.2764 (zf + 23). (7.25) 


This is the equation of a circle with (approximate) center (0.019, —0.018) and radius 0.48, 
which is very close to the true circular boundary between the two groups, with center (0, 0) 
and radius 0.5. This circle is drawn in Figure 7.10. 


1 x 








-1 0 1 
Figure 7.10: The circular decision boundary can be viewed equivalently as (a) the pro- 


jection onto the xı, x2 plane of the intersection of the separating plane with the quadratic 
surface (both in R°), or (b) the set of points x = (x1, x2) for which g,(x) = Bo + B' G(x) = 0. 


An equivalent way to derive this circular separating boundary is to consider the feature 
map (x) = [x1, X2, x? + x5]" on R’, which defines a reproducing kernel 


K(x, x’) = P(x)" P(x’), 
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on R°, which in turn gives rise to a (unique) RKHS H. The optimal prediction function 
(7.19) is now of the form 


1 n 
80) = a0 H T D VAPEN WX) = Bo + BOX), (7.26) 
i=1 
where œo = po and 
1 n 
B= 7 > Aj §(Xi). 
i=1 


The decision boundary, {x : g;(x) = 0}, is again a circle in R?. The following code de- 
termines the fitted model parameters and the decision boundary. Figure 7.10 shows the 
optimal decision boundary, which is identical to (7.25). The function mykernel specifies 
the custom kernel above. 





import numpy as np, matplotlib.pyplot as plt 
from numpy import genfromtxt 
from sklearn.svm import SVC 


def mykernel (U,V): 
tmpU = np.sum(np.power(U,2),axis=1).reshape(len(U) ,1) 
U = np.hstack((U, tmpU) ) 
tmpV = np.sum(np.power(V,2),axis=1).reshape(len(V) ,1) 
V = np.hstack((V,tmpV)) 
Kee) UG) Vier 
print (K. shape) 
return K 


# read in the data 

inp = genfromtxt('svmcirc.csv', delimiter=',') 
data = inp[:,[0,1]] #vectors are rows 

y = inp[:,[2]].reshape(len(data),) #labels 


clf = SVCCC = np.inf, kernel=mykernel, gamma='auto') # custom kernel 
# elf = SVCC(C = np. inf, kernel="rbf", gamma='scale') # inbuilt 


clf.fit(Cdata,y) 


print("Support Vectors \n", clf.support_vectors_) 
print("Support Vector Labels ",y[clf.support_]) 
print("Nu ",clf.dual_coef_) 

printC"Bias ",clf.intercept_) 


# plot 

d = 0.001 

x min, x max = -1,1 

y_min, y_max = -1,1 

XxX, yy = np.meshgrid(np.arange(x_min, x_max, d), np.arange(y_min, 
y_max, d)) 


plt.plot(Cdata[clf.support_ ,0],data[clf.support_,1],'go') 
plt.plot(data[y==1,0],data[y==1,1],'b.') 
plt.plot(data[y==-1,0],data[y==-1,1],'rx') 
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clf.predict(np.c_[xx.ravel(Q), yy.ravel()]) 
Z.reshape(xx.shape) 


oul 


Z 
Z 


plt.contour(xx, yy, Z,colors ="k") 
plt.show() 





Finally, we illustrate the use of the Gaussian kernel 
K(x, x") = Nee (1.27) 


where c > 0 is some tuning constant. This is an example of a radial basis function kernel, 
which are reproducing kernels of the form x(x, x’) = f(|lx — x’||), for some positive real- 
valued function f. Each feature vector x is now transformed to a function Ky = k(x,-). We 
can think of it as the (unnormalized) pdf of a Gaussian distribution centered around x, and 
g- is a (signed) mixture of these pdfs, plus a constant; that is, 


n 

= = yl] 

B(x) = 09+ > a A, 
i=] 


Replacing in Line 2 of the previous code mykernel with ’rbf’ produces the SVM 
parameters given in Table 7.7. Figure 7.11 shows the decision boundary, which is not ex- 
actly circular, but is close to the true (circular) boundary {x : ||x|| = 1/2}. There are now 
seven support vectors, rather than the four in Figure 7.10. 


Table 7.7: Optimal support vector machine parameters for the Gaussian kernel case. 








x! y a(x10°) x" y a(xl10°) 

0.0388 0.5380 -1 -0.0635 —0.4374 0.3854 -1 -1.4399 
—0.4912 -0.2056 -1 -9.4793 0.3402 -0.5740 -1 —0.1000 
0.5086 0.1576 -1 -0.5240 —0.4098 -0.1763 1 6.0662 


—0.4507 -0.0480 1 5.5405 














Figure 7.11: Left: The decision boundary {x : g;(x) = 0} is roughly circular, and separates 
the two classes well. There are seven support vectors, indicated by green circles. Right: 
The graph of g- is a scaled mixture of Gaussian pdfs plus a constant. 
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E Remark 7.2 (Scaling and Penalty Parameters) When using a radial basis function in 
SVC in sklearn, the scaling c (7.27) can be set via the parameter gamma. Note that large 
values of gamma lead to highly peaked predicted functions, and small values lead to highly 
smoothed predicted functions. The parameter C in SVC refers C = 1/y in (7.23). E 


7.8 Classification with Scikit-Learn 


In this section we apply several classification methods to a real-world data set, using the 
Python module sklearn (the package name is Scikit-Learn). Specifically, the data is ob- 
tained from UCI’s Breast Cancer Wisconsin data set. This data set, first published and 
analyzed in [118], contains the measurements related to 569 images of 357 benign and 
212 malignant breast masses. The goal is to classify a breast mass as benign or malig- 
nant based on 10 features: Radius, Texture, Perimeter, Area, Smoothness, Compactness, 
Concavity, Concave Points, Symmetry, and Fractal Dimension of each mass. The mean, 
standard error, and “worst” of these attributes were computed for each image, resulting in 
30 features. For instance, feature 1 is Mean Radius, feature 11 is Radius SE, feature 21 is 
Worst Radius. 

The following Python code reads the data, extracts the response vector and model (fea- 
ture) matrix and divides the data into a training and test set. 


skclass1.py 


numpy import genfromtxt 
sklearn.model_selection import train_test_split 

"http://mlr.cs.umass.edu/ml/machine-learning-databases/" 

= "breast-cancer-wisconsin/" 

= "wdbc.data" 
genfromtxt(urll + url2 + name, delimiter=',', dtype=str) 

= data[:,1] #responses 
data[:,2:].astype('float') #features as an ndarray matrix 


X_train , X_test , y_train , y_test = train_test_split( 
X, y, test_size = 90.4, random_state = 1234) 


To visualize the data we create a 3D scatterplot for the features mean radius, mean 
texture, and mean concavity, which correspond to the columns 0, 1, and 6 of the model 
matrix X. Figure 7.12 suggests that the malignant and benign breast masses could be well 
separated using these three features. 





from skclassl import X, y 

import matplotlib.pyplot as plt 

from mpl_toolkits.mplot3d import Axes3D 
import numpy as np 


Bidx = np.where(y == 'B') 
Midx= np.where(y == 'M') 


# plot features Radius (column 9), Texture (1), Concavity (6) 





278 


7.8. Classification with Scikit-Learn 





fig = plt.figure() 

ax = fig.gca(projection = '3d') 

ax.scatter(X[Bidx,0], X[Bidx,1], X[Bidx,6], 
c='r', marker='4', label='Benign') 

ax.scatter(X[Midx ,0], X[Midx,1], X[Midx,6], 
c='b', marker='o', label='Malignant') 

ax. legend() 

ax.set_xlabel('Mean Radius') 

ax.set_ylabel('Mean Texture') 

ax.set_zlabel('Mean Concavity') 


plt.show() 





a Benign 
e Malignant 


o 
P 


0.3 


Mean Concavity 





Figure 7.12: Scatterplot of three features of the benign and malignant breast masses. 


The following code uses various classifiers to predict the category of breast masses 
(benign or malignant). In this case the training set has 341 elements and the test set has 228 
elements. For each classifier the percentage of correct predictions (that is, the accuracy) in 
the test set is reported. We see that in this case quadratic discriminant analysis gives the 
highest accuracy (0.956). Exercise 18 explores the question whether this metric is the most 


appropriate for these data. 





from skclass1 import X_train, y_train, X_test, y_test 


from sklearn.metrics import accuracy_score 


import sklearn.discriminant_analysis as DA 

from sklearn.naive_bayes import GaussianNB 

from sklearn.neighbors import KNeighborsClassifier 
from sklearn.linear_model import LogisticRegression 
from sklearn.svm import SVC 


names = ["Logit","NBayes", "LDA", "QDA", "KNN", "SVM"] 
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classifiers = [LogisticRegression(C=1e5), 
GaussianNB(), 
DA.LinearDiscriminantAnalysis(), 
DA.QuadraticDiscriminantAnalysis(), 
KNeighborsClassifier(n_neighbors=5), 
SVC(Ckernel='rbf', gamma = 1le-4)] 


print('Name Accuracy\n'+14*'-') 
for name, clf in zip(names, classifiers): 
clf.fit(X_train, y_train) 
y_pred = clf.predict(X_test) 
printC'{:6} {:3.3f£}'.format(name, accuracy_score(y_test,y_pred))) 


Name Accuracy 





Further Reading 


An excellent source for understanding various pattern recognition techniques is the book 
[35] by Duda et al. Theoretical foundations of classification, including the Vapnik— 
Chernovenkis dimension and the fundamental theorem of learning, are discussed in 
[109, 121, 122]. A popular measure for characterizing the performance of a binary classi- 
fier is the receiver operating characteristic (ROC) curve [38]. The naive Bayes classific- 
ation paradigm can be extended to handle explanatory variable dependency via graphical 
models such as Bayesian networks and Markov random fields [46, 66, 69]. For a detailed 
discussion on Bayesian decision theory, see [8]. 


Exercises 


1. Let O < w < 1. Show that the solution to the convex optimization problem 


peony 


7 ; (7.28) 


subject to: we = wand Pi = ly 


i-1 i=1 


is given by p; = w/(n-1),i=1,...,n—1 and p, = 1 — w. 


2. Derive the formulas (7.14) by minimizing the cross-entropy training loss: 


1 n 
mara l is Vi | 9), 
22 n g(x;, yi |0) 
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rs 158 


where g(x, y|@) is such that: 
2 1 1 Ty-1 P 
In g(x, y|@) = Ina, — 5 In [X] — 5 (X — My) Ly (x — py) - 5 In(2z). 


3. Adapt the code in Example 7.2 to plot the estimated decision boundary instead of the 
true one in Figure 7.3. Compare the true and estimated decision boundaries. 


4. Recall from equation (7.16) that the decision boundaries of the multi-logit classifier are 
linear, and that the pre-classifier can be written as a conditional pdf of the form: 


exp(Zy+1) 


W, b,x) = ———_—., 
gol x) IRAN 


ye {0,...,c- 1}, 


where x" =[1,x'] and z = WY + b. 


(a) Show that the linear discriminant pre-classifier in Section 7.4 can also be written as a 


conditional pdf of the form (0 = {@y, Xy, ae 


exp(Zy+1) 





g(y|0,x) = y €{0,...,c—-1]}, 


where x" = [1,x'] and z = Wx + b. Find formulas for the corresponding b and W 


in terms of the linear discriminant parameters {q,, Hy» a a where X, = X for all y. 


(b) Explain which pre-classifier has smaller approximation error: the linear discriminant 
or multi-logit one? Justify your answer by proving an inequality between the two 
approximation errors. 


5. Consider a binary classification problem where the response Y takes values in {—1, 1}. 
Show that optimal prediction function for the hinge loss Loss(y, y) = (1—yy), := max{0, 1— 
yy} is the same as the optimal prediction function g* for the indicator loss: 


. 1 if P[Y=1|X =x] > 1/2, 
co=| 
1 if P[Y=1|X =x] <1/2. 


That is, show that 
Ed -YhA(X)), > Ed -Ye"*(X)), (7.29) 


for all functions h. 


6. In Example 4.12, we applied a principal component analysis (PCA) to the iris data, 
but refrained from classifying the flowers based on their feature vectors x. Implement a 
1-nearest neighbor algorithm, using a training set of 50 randomly chosen data pairs (x,y) 
from the iris data set. How many of the remaining 100 flowers are correctly classified? 
Now classify these entries with an off-the-shelf multi-logit classifier, e.g., such as can be 
found in the sklearn and statsmodels packages. 


7. Figure 7.13 displays two groups of data points, given in Table 7.8. The convex hulls 
have also been plotted. It is possible to separate the two classes of points via a straight line. 
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In fact, many such lines are possible. SVM gives the best separation, in the sense that the 
gap (margin) between the points is maximal. 














Figure 7.13: Separate the points by a straight line so that the separation between the two 
groups is maximal. 


Table 7.8: Data for Figure 7.13. 








xX} X2 y x1 X2 y 
2.4524 5.5673 -1 0.5819 —1.0156 1 
1.2743 0.8265 1 1.2065 3.2984 -1 
0.8773 —0.5478 1 2.6830 0.4216 1 
1.4837 3.0464 -1 —0.0734 1.3457 1 
0.0628 4.0415 -1 0.0787 0.6363 1 
—2.4151 —0.9309 1 0.3816 5.2976 -1 
1.8152 3.9202 -1 0.3386 0.2882 1 
1.8557 2.7262 -1 —0.1493 —0.7095 1 
—0.4239 1.8349 1 1.5554 4.9880 -1 
1.9630 0.6942 1 3.2031 4.4614 -1 


(a) Identify from the figure the three support vectors. 


(b) For a separating boundary (line) given by By + B'x = 0, show that the margin width 
is 2/||Bll. 


(c) Show that the parameters o and £ that solve the convex optimization problem (7.24) 
provide the maximal width between the margins. 


(d) Solve (7.24) using a penalty approach; see Section B.4. In particular, minimize the ms 415 
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penalty function 
S(B,Bo) = III? - C $` min {(Bo + B™xi) yi - 1, 0) 
i=l 


for some positive penalty constant C. 


(e) Find the solution the dual optimization problem (7.21) by using sklearn’s SCV 
method. Note that, as the two point sets are separable, the constraint A < 1 may 
be removed, and the value of y can be set to 1. 


8. In Example 7.6 we used the feature map $(x) = [x), x2, x? + x5]" to classify the points. 
An easier way is to map the points into R! via the feature map (x) = ||x|| or any monotone 
function thereof. Translated back into R? this yields a circular separating boundary. Find 
the radius and center of this circle, using the fact that here the sorted norms for the two 
groups are ...,0.4889, 0.5528, .... 


9. Let Y € {0, 1} be a response variable and let h(x) be the regression function 
A(x) := E[Y | X =x] = P[Y =1|X = x]. 


Recall that the Bayes classifier is g*(x) = I{h(x) > 1/2}. Let g : R — {0,1} be any 
other classifier function. Below, we denote all probabilities and expectations conditional 
on X = x as P,[-] and E,[-]. 


(a) Show that 


irreducible error 


a 7 
P,[g(x) + Y] = Pyle“ (Œ) + Y] +2h(x) — 1| 1{g(x) # g*(x)}. 
Hence, deduce that for a learner gy constructed from a training set 7, we have 
E[P.[g7(x) # YIT]] = Px[g"(x) # Y] + |2A(x) - 1| Pigr(x) # 8“ (x)], 


where the first expectation and last probability operations are with respect to T. 


(b) Using the previous result, deduce that for the unconditional error (that is, we no longer 
condition on X = x), we have 


P[g*(X) # Y] < PIs7(X) + Y]. 
(c) Show that, if g7 := l{hy(x) > 1/2} is a classifier function such that as n > oo 
hip (x) > Z ~ NUE), P) 
for some mean and variance functions u(x) and o(x), respectively, then 


sign(1 — 2h(x))(2u(x) — o) 


P,[g7(x) + 8“ (x)] — of I) 


where ©® is the cdf of a standard normal random variable. 





Chapter 7. Classification 283 
10. The purpose of this exercise is to derive the dual program (7.21) from the primal 
program (7.20). The starting point is to introduce a vector of auxiliary variables € := 
[é1,...,&,]' and write the primal program as 
min > éi + Y aKa 
@,00,6 = 2 
, T (7.30) 
subject to: € > 0, 
y(a@ + {Ka};) > 1- é, i=1,...,n. 
(a) Apply the Lagrangian optimization theory from Section B.2.2 to obtain the Lag- rs 406 
rangian function L({ao, œ, é}, {A, u}), where u and A are the Lagrange multipliers cor- 
responding to the first and second inequality constraints, respectively. 
(b) Show that the Karush—Kuhn—Tucker (see Theorem B.2) conditions for optimizing £ ns 407 


are: 
My =0 
a=yOaly 
0<A<!1 (7.31) 
-Ao€=0, A Oigh)-1+§) =0,i=1,...,n 
E> 0, yie(x;,) -1+€ 20, i=1,...,n. 
Here © stands for componentwise multiplication; e.g., yO A = [y141, - - - , Yn4An]' , and 


we have abbreviated a + {Ka}; to g(x;), in view of (7.19). [Hint: one of the KKT 
conditions is A = 1 — ys; thus we can eliminate y.] 


(c) Using the KKT conditions (7.31), reduce the Lagrange dual function £*(A) := 
MINgy wé L({a0, Q, é), {A, 1 = A} to 


n 1 n n 
LD = Y d- z DD AVK). (7.32) 
i=1 i=1 j=l 
(d) As a consequence of (7.19) and (a)-(c), show that the optimal prediction function g, 
is given by 
1 n 
B(x) = ao + = ) Yidi M(H, x), (7.33) 
Y i=1 
where J is the solution to 
max L(A) 


(7.34) 
subject to: A'y =0,0<A<1, 


and œ = yj — 5 YL vidi K(X; x;) for any j such that 4; € (0, 1). 
11. Consider SVM classification as illustrated in Figure 7.7. The goal of this exercise is to 
classify the training points {(x;, y;)} based on the value of the multipliers {4;} in Exercise 10. 
Let é; be the auxiliary variable in Exercise 10, i = 1,...,n. 
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(a) For A; € (0, 1) show that (x;, y;) lies exactly on the decision border. 
(b) For 4; = 1, show that (x;, y;) lies strictly inside the margins. 


(c) Show that for A; = 0 the point (x;, y;) lies outside the margins and is correctly classi- 
fied. 


12. A well-known data set is the MNIST handwritten digit database, containing many 
thousands of digitalized numbers (from 0 to 9), each described by a 28 x 28 matrix of gray 
scales. A similar but much smaller data set is described in [63]. Here, each handwritten 
digit is summarized by a 8 x 8 matrix with integer entries from 0 (white) to 15 (black). 
Figure 7.14 shows the first 50 digitized images. The data set can be accessed with Python 
using the sklearn package as follows. 


from sklearn import datasets 
digits = datasets.load_digits() 


x_digits = digits.data # explanatory variables 
y_digits = digits.target # responses 
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Figure 7.14: Classify the digitized images. 


(a) Divide the data into a 75% training set and 25% test set. 


(b) Compare the effectiveness of the K-nearest neighbors and naive Bayes method to 
classify the data. 


(c) Assess which K to use in the K-nearest neighbors classification. 


13. Download the winequality-red.csv data set from UCI’s wine-quality website. 
The response here is the wine quality (from O to 10) as specified by a wine “expert” 
and the explanatory variables are various characteristics such as acidity and sugar con- 
tent. Use the SVC classifier of sklearn.svm with a linear kernel and penalty para- 
meter C = 1 (see Remark 7.2) to fit the data. Use the method cross_val_score from 
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sklearn.model_selection to obtain a five-fold cross-validation score as an estimate of 
the probability that the predicted class matches the expert’s class. 


14. Consider the credit approval data set crx.data from UCI’s credit approval website. 
The data set is concerned with credit card applications. The last column in the data set 
indicates whether the application is approved (+) or not (—). With the view of preserving 
data privacy, all 15 explanatory variables were anonymized. Note that some explanatory 
variables are continuous and some are categorical. 


(a) Load and prepare the data for analysis with sklearn. First, eliminate data 
rows with missing values. Next, encode categorical explanatory variables using a 
OneHotEncoder object from sklearn. preprocessing to create a model matrix X 
with indicator variables for the categorical variables, as described in Section 5.3.5. 


(b) The model matrix should contain 653 rows and 46 columns. The response variable 
should be a 0/1 variable (reject/approve). We will consider several classification al- 
gorithms and test their performance (using a zero-one loss) via ten-fold cross valida- 
tion. 


i. Write a function which takes 3 parameters: X, y, and a model, and returns the 
ten-fold cross-validation estimate of the expected generalization risk. 

ii. Consider the following sklearn classifiers: KNeighborsClassifier (k = 5), 
LogisticRegression, and MPLClassifier (multilayer perceptron). Use the 
function from (1) to identify the best performing classifier. 


15. Consider a synthetic data set that was generated in the following fashion. The explan- 
atory variable follows a standard normal distribution. The response label is 0 if the explan- 
atory variable is between the 0.95 and 0.05 quantiles of the standard normal distribution, 
and 1, otherwise. The data set was generated using the following code. 


import numpy as np 
import scipy.stats 
# generate data 


np.random.seed(12345) 

100 

np.random.randn(N) 
scipy.stats.norm.ppf(0.95) 
np. zeros (N) 

y[X>=q] = 1 

y[X<=-q] = 1 

X = X.reshape(-1,1) 





Compare the K-nearest neighbors classifier with K = 5 and logistic regression classi- 
fier. Without computation, which classifier is likely to be better for these data? Verify your 
answer by coding both classifiers and printing the corresponding training 0-1 loss. 


16. Consider the digits data set from Exercise 12. In this exercise, we would like to train 
a binary classifier for the identification of digit 8. 


(a) Divide the data such that the first 1000 rows are used as the training set and the rest 
are used as the test set. 
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(b) Train the LogisticRegression classifier from the sklearn.linear_model pack- 
age. 


(c) “Train” a naive classifier that always returns 0. That is, the naive classifier identifies 
each instance as being not 8. 


(d) Compare the zero-one test losses of the logistic regression and the naive classifiers. 


(e) Find the confusion matrix, the precision, and the recall of the logistic regression clas- 
sifier. 


(f) Find the fraction of eights that are correctly detected by the logistic regression clas- 
sifier. 


17. Repeat Exercise 16 with the original MNIST data set. Use the first 60,000 rows as the 
train set and the remaining 10,000 rows as the test set. The original data set can be obtained 
using the following code. 


from sklearn.datasets import fetch_openml 


fetch_openml ('mnist_784', version=1, return_X_y=True) 





18. For the breast cancer data in Section 7.8, investigate and discuss whether accuracy is 
the relevant metric to use or if other metrics discussed in Section 7.2 are more appropriate. 


CHAPTER 8 





DECISION TREES AND ENSEMBLE 
METHODS 





Statistical learning methods based on decision trees have gained tremendous pop- 
ularity due to their simplicity, intuitive representation, and predictive accuracy. This 
chapter gives an introduction to the construction and use of such trees. We also dis- 
cuss two key ensemble methods, namely bootstrap aggregation and boosting, which 
can further improve the efficiency of decision trees and other learning methods. 


8.1 Introduction 


Tree-based methods provide a simple, intuitive, and powerful mechanism for both regres- 
sion and classification. The main idea is to divide a (potentially complicated) feature space 
X into smaller regions and fit a simple prediction function to each region. For example, 
in a regression setting, one could take the mean of the training responses associated with 
the training features that fall in that specific region. In the classification setting, a com- 
monly used prediction function takes the majority vote among the corresponding response 
variables. We start with a simple classification example. 


E Example 8.1 (Decision Tree for Classification) The left panel of Figure 8.1 shows a 
training set of 15 two-dimensional points (features) falling into two classes (red and blue). 
How should the new feature vector (black point) be classified? 
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Figure 8.1: Left: training data and a new feature. Right: a partition of the feature space. 
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8.1. Introduction 





DECISION TREE 


REGIONAL 


PREDICTION 
FUNCTIONS 


It is not possible to linearly separate the training set, but we can partition the feature 
space X = R? into rectangular regions and assign a class (color) to each region, as shown 
in the right panel of Figure 8.1. Points in these regions are classified accordingly as blue 
or red. The partition thus defines a classifier (prediction function) g that assigns to each 
feature vector x a class “red” or “blue”. For example, for x = [—15,0]" (solid black point), 
g(x) = “blue”, since it belongs to a blue region of the feature space. 

Both the classification procedure and the partitioning of the feature space can be con- 
veniently represented by a binary decision tree. This is a tree where each node v corres- 
ponds to a region (subset) R, of the feature space X — the root node corresponding to the 
feature space itself. 

Each internal node v contains a logical condition that di- 





















































vides R, into two disjoint subregions. The leaf nodes (the ter- X2 S 12.0 

minal nodes of the tree) are not subdivided, and their corres- True False 

ponding regions form a partition of X, as they are disjoint and Xx < -20.5 

their union is X. Associated with each leaf node w is also a 

regional prediction function g” on Ry. True False 
The partitioning of Figure 8.1 was obtained from xı < 20.0 

the decision tree shown in Figure 8.2. As an illustra- True / 

tion of the decision procedure, consider again the input "ATSE 

x = [%1,%2]' = [-15,0]". The classification process starts x S25 

from the tree root, which contains the condition x2 < 12.0. As True / False 

the second component of x is 0, the root condition is satisfied xı < -5.0 











and we proceed to the left child, which contains the condition 

x2 < —20.5. The next step is similar. As 0 > —20.5, the condi- g i 
tion is not satisfied and we proceed to the right child. Such an 
evaluation of logical conditions along the tree path will even- 
tually bring us to a leaf node and its associated region. In this 
case the process terminates in a leaf that corresponds to the 
left blue region in the right-hand panel of Figure 8.1. 


Figure 8.2: The decision- 
tree that corresponds to the 
partition in Figure 8.1. 


More generally, a binary tree T will partition the feature space X into as many regions 
as there are leaf nodes. Denote the set of leaf nodes by W. The overall prediction function 
g that corresponds to the tree can then be written as 


g(x) = X, œ Lx E Ry), (8.1) 
wew 

where 1 denotes the indicator function. The representation (8.1) is very general and de- 
pends on (1) how the regions {R„} are constructed via the logical conditions in the decision 
tree, as well as (2) how the regional prediction functions of the leaf nodes are defined. 
Simple logical conditions of the form x; < é split a Euclidean feature space into rect- 
angles aligned with the axes. For example, Figure 8.2 partitions the feature space into six 
rectangles: two blue and four red rectangles. 

In a classification setting, the regional prediction function g” corresponding to a leaf 
node w takes values in the set of possible class labels. In most cases, as in Example 8.1, it 
is taken to be constant on the corresponding region &,,. In a regression setting, g” is real- 
valued and also usually takes only one value. That is, every feature vector in R, leads to 
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the same predicted value. Of course, different regions will usually have different predicted 
values. 

Constructing a tree with a training set t = {(x;,y;)}}/_; amounts to minimizing the 
training loss 


1 n 
L8) = = X | Loss(yi, g) (8.2) 
i=1 


for some loss function; see Chapter 2. With g of the form (8.1), we can write 


č Í x 
&(g) = = > Loss(yi, gi) = a X, Dd, Mai € Ry} Lossi, g(x) (8.3) 
i=l i=1 wew 
2 1 D Lix; € R,,} Loss(y;, 2" (x;)), (8.4) 
wew n i=1 





(*) 


where (+) is the contribution by the regional prediction function g” to the overall training 
loss. In the case where all {x;} are different, finding a decision tree T that gives a zero 
squared-error or zero—one training loss is easy, see Exercise 1, but such an “overfitted” tree 
will have poor predictive behavior, expressed in terms of the generalization risk. Instead 
we consider a restricted class of decision trees and aim to minimize the training loss within 
that class. It is common to use a top-down greedy approach, which can only achieve an 
approximate minimization of the training loss. 


8.2 Top-Down Construction of Decision Trees 


Let t = {(x;,y,)}{_, be the training set. The key to constructing a binary decision tree T 
is to specify a splitting rule for each node v, which can be defined as a logical function 
S : X — {False, True} or, equivalently, a binary function s : X — {0,1}. For example, 
in the decision tree of Figure 8.2 the root node has splitting rule x > I{x, < 12.0}, in 
correspondence with the logical condition {x; < 12.0}. During the construction of the tree, 
each node v is associated with a specific region R, © X and therefore also the training 
subset {(x, y) E T : x E R,} C T. Using a splitting rule s, we can divide any subset o of the 
training set T into two sets: 


Or :={(xX,y)€o : s(x) = True} and op := {(x,y)€o : s(x) = False}. (8.5) 


Starting from an empty tree and the initial data set r, a generic decision tree con- 
struction takes the form of the recursive Algorithm 8.2.1. Here we use the notation 
T, for a subtree of T starting from node v. The final tree T is thus obtained via T = 
Construct_Subtree(vo, T), where vo is the root of the tree. 


is 19 


SPLITTING RULE 


290 


8.2. Top-Down Construction of Decision Trees 





Algorithm 8.2.1: Construct_Subtree 
Input: A node v and a subset of the training data: o € T. 
Output: A (sub) decision tree T,. 





1 if termination criterion is met then // v is a leaf node 
2 Train a regional prediction function g” using the training data o. 

3 else // split the node 
4 Find the best splitting rule s, for node v. 

5 Create successors vr and vp of v. 

6 Or — {(x,y) ET : s,(x) = True} 

7 Or — {(x,y)€o : s(x) = False} 

8 T,, <— Construct_Subtree (vr, or) // left branch 
9 T,, <— Construct_Subtree (vp, oF) // right branch 
10 return T, 


The splitting rule s, divides the region R, into two disjoint parts, say R,, and R,,. The 
corresponding prediction functions, g7 and g™, satisfy 


g(x) = g(x) A{x E Ra} + 9° (x) Ix ER}, x ER. 


In order to implement the procedure described in Algorithm 8.2.1, we need to address 
the construction of the regional prediction functions g” at the leaves (Line 2), the specific- 
ation of the splitting rule (Line 4), and the termination criterion (Line 1). These important 
aspects are detailed in the following Sections 8.2.1, 8.2.2, and 8.2.3, respectively. 


8.2.1 Regional Prediction Functions 


In general, there is no restriction on how to choose the prediction function g” for a leaf 
node v = w in Line 2 of Algorithm 8.2.1. In principle we can train any model from the 
data; e.g., via linear regression. However, in practice very simple prediction functions are 
used. Below, we detail a popular choice for classification, as well as one for regression. 


1. In the classification setting with class labels 0,...,c — 1, the regional prediction 
function g” for leaf node w is usually chosen to be constant and equal to the most 
common class label of the training data in the associated region R, (ties can be 
broken randomly). More precisely, let n, be the number of feature vectors in region 


R, and let 
p? = a. > Liy=z)» 
W {(x,y)eT : XERy} 
be the proportion of feature vectors in R, that have class label z = 0,...,c — 1. The 


regional prediction function for node w is chosen to be the constant 


g' (x) = argmax př. (8.6) 


2. In the regression setting, g" is usually chosen as the mean response in the region; 
that is, 
w = l 
eE E >; y, (8.7) 


w {(x,y)ET : xER,} 
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where n,, is again the number of feature vectors in R,,. It is not difficult to show that 
g'(X) = Yg, minimizes the squared-error loss with respect to all constant functions, 
in the region Rẹ; see Exercise 2. 


8.2.2 Splitting Rules 


In Line 4 in Algorithm 8.2.1, we divide region R, into two sets, using a splitting rule 
(function) s,. Consequently, the data set o associated with node v (that is, the subset of the 
original data set r whose feature vectors lie in R,), is also split — into or and oy. What is 
the benefit of such a split in terms of a reduction in the training loss? If v were set to a leaf 
node, its contribution to the training loss would be (see (8.4)): 


1x ‘ 
a 2, Liæyeo Loss, 8" (x;)). (8.8) 
If v were to be split instead, its contribution to the overall training loss would be: 
1 n 1 n 
7 2, Liyo LOSSY; g'(x;)) + Pe Liæyeor} LOSSY; g(x), (8.9) 


where g7 and g" are the prediction functions belonging to the child nodes vy and vp. A 
greedy heuristic is to pretend that the tree construction algorithm immediately terminates 
after the split, in which case vr and vp are leaf nodes, and g7 and g" are readily evaluated 
— e.g., as in Section 8.2.1. Note that for any splitting rule the contribution (8.8) is always 
greater than or equal to (8.9). It therefore makes sense to choose the splitting rule such that 
(8.9) is minimized. Moreover, the termination criterion may involve comparing (8.9) with 
(8.8). If their difference is too small it may not be worth further splitting the feature space. 
As an example, suppose the feature space is X = R? and we consider splitting rules of 

the form 
s(x) = 1{x; < £}, (8.10) 


for some | < j < p and £ € R, where we identify 0 with False and | with True. Due to the 
computational and interpretative simplicity, such binary splitting rules are implemented in 
many software packages and are considered to be the de facto standard. As we have seen, 
these rules divide up the feature space into rectangles, as in Figure 8.1. It is natural to ask 
how j and £ should be chosen so as to minimize (8.9). For a regression problem, using a 
squared-error loss and a constant regional prediction function as in (8.7), the sum (8.9) is 


given by 
1 — 42 1 — \2 
=- >, 0- += Dd) 0-3, (8.11) 


(x y)ET:x jE (x, y)ET:x j> 


where yy and y, are the average responses for the or and or data, respectively. Let {xj} 
denote the possible values of x; j = 1,...,p within the training subset o (with m < n 
elements). Note that, for a fixed j, (8.11) is a piecewise constant function of é, and that its 
minimal value is attained at some value x;;. As a consequence, to minimize (8.11) over 
all j and €, it suffices to evaluate (8.11) for each of the m x p values xj, and then take the 
minimizing pair (j, xx). 
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MISCLASSIFICATION 
IMPURITY 


ENTROPY 
IMPURITY 


GINI IMPURITY 


For a classification problem, using the indicator loss and a constant regional prediction 
function as in (8.6), the aim is to choose a splitting rule that minimizes 


1 1 
- >) Uy #yt+— >, Ly # yeh, (8.12) 


(x,y)eor (X,y)EOF 


where y;. = g’(x) is the most prevalent class (majority vote) in the data set cr and y; is the 
most prevalent class in oy. If the feature space is X = R? and the splitting rules are of the 
form (8.10), then the optimal splitting rule can be obtained in the same way as described 
above for the regression case; the only difference is that (8.11) is replaced with (8.12). 

We can view the minimization of (8.12) as minimizing a weighted average of “impur- 
ities” of nodes or and oy. Namely, for an arbitrary training subset o C T, if y* is the most 
prevalent label, then 


1 ; 1 E 
io] by ee | ` I{fy=y"}=l-py=1—- max p, 


ze{0,...c-1} 
(x,y)er (x,y)er 
where p, is the proportion of data points in ø that have class label z, z = 0,...,c — 1. The 
quantity 
l1- max p; 
eb an” 


measures the diversity of the labels in o and is called the misclassification impurity. Con- 
sequently, (8.12) is the weighted sum of the misclassification impurities of or and op, with 
weights by |or|/n and |o¢|/n, respectively. Note that the misclassification impurity only 
depends on the label proportions rather than on the individual responses. Instead of using 
the misclassification impurity to decide if and how to split a data set o, we can use other 
impurity measures that only depend on the label proportions. Two popular choices are the 
entropy impurity: 


col 
E ` pz log,(pz) 


and the Gini impurity: 


All of these impurities are maximal when the label proportions are equal to 1/c. Typical 
shapes of the above impurity measures are illustrated in Figure 8.3 for the two-label case, 
with class probabilities p and 1 — p. We see here the similarity of the different impurity 
measures. Note that impurities can be arbitrarily scaled, and so using In(p,) = log,(p-) In(2) 
instead of log,(p,) above gives an equivalent entropy impurity. 


8.2.3 Termination Criterion 


When building a tree, one can define various types of termination conditions. For example, 
we might stop when the number of data points in the tree node (the size of the input o set 
in Algorithm 8.2.1) is less than or equal to some predefined number. Or we might choose 
the maximal depth of the tree in advance. Another possibility is to stop when there is no 


Chapter 8. Decision Trees and Ensemble Methods 


293 





— _ cross-entropy 
—— Gini index 
0.6 — misclassification 


impurity 
> 
A 


0.2 





Figure 8.3: Entropy, Gini, and misclassification impurities for binary classification, with 
class frequencies p; = p and p2 = 1 — p. The entropy impurity was normalized (divided by 
2), to ensure that all impurity measures attain the same maximum value of 1/2 at p = 1/2. 


significant advantage, in terms of training loss, to split regions. Ultimately, the quality of a 
tree is determined by its predictive performance (generalization risk) and the termination 
condition should aim to strike a balance between minimizing the approximation error and 
minimizing the statistical error, as discussed in Section 2.4. 


m Example 8.2 (Fixed Tree Depth) To illustrate how the tree depth impacts on the gener- 
alization risk, consider Figure 8.4, which shows the typical behavior of the cross-validation 
loss as a function of the tree depth. Recall that the cross-validation loss is an estimate of the 
expected generalization risk. Complicated (deep) trees tend to overfit the training data by 
producing many divisions of the feature space. As we have seen, this overfitting problem is 
typical of all learning methods; see Chapter 2 and in particular Example 2.1. To conclude, 
increasing the maximal depth does not necessarily result in better performance. 
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Figure 8.4: The ten-fold cross-validation loss as a function of the maximal tree depth for a 
classification problem. The optimal maximal tree depth is here 6. 
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To create Figure 8.4 we used! the Python method make_blobs from the sklearn 
r= 490 module to produce a training set of size n = 5000 with ten-dimensional feature vectors 


(thus, p = 10 and X = R!°), each of which is classified into one of c = 3 classes. The full 


code is given below. 


TreeDepthcCV.py 


import numpy as np 

from sklearn.datasets import make_blobs 

from sklearn.model_selection import cross_val_score 
from sklearn.tree import DecisionTreeClassifier 
from sklearn.metrics import zero_one_loss 

import matplotlib.pyplot as plt 


def ZeroOneScore(clf, X, y): 
y_pred = clf.predict(X) 
return zero_one_loss(y, y_pred) 


# Construct the training set 
X, y = make_blobs(n_samples=5000, n_features=10, centers=3, 
random_state=10, cluster_std=10) 


# construct a decision tree classifier 
clf = DecisionTreeClassifier(random_state=0) 


# Cross-validation loss as a function of tree depth (1 to 30) 
xdepthlist = [] 
cvlist = [] 
tree_depth = range(1,30) 
for d in tree_depth: 
xdepthlist.append(d) 
clf.max_depth=d 
cv = np.mean(cross_val_score(clf, X, y, cv=10, scoring= 
ZeroOneScore) ) 
cvlist.append(cv) 


plt.xlabel('tree depth', fontsize=18, color='black') 
plt.ylabelC'loss', fontsize=18, color='black') 
plt.plot(xdepthlist, cvlist,'-*' , linewidth=0.5) 





The code above relies heavily on sklearn and hides the implementation details. To 
show how decision trees are actually constructed using the previous theory, we proceed 


with a very basic implementation. 


8.2.4 Basic Implementation 


In this section we implement a regression tree, step by step. To run the program, amalgam- 
ate the code snippets below into one file, in the order presented. First, we import various 


packages and define a function to generate the training and test data. 





'The data used for Figure 8.1 was produced in a similar way. 
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BasicTree. py 


import numpy as np 
from sklearn.datasets import make_friedmanl 
from sklearn.model_selection import train_test_split 


def makedata(): 
n_points = 500 # number of samples 


X, y = make_friedmani(n_samples=n_points, n_features=5, 
noise=1.0, random_state=100) 
return train_test_split(X, y, test_size=0.5, random_state=3) 





The “main” method calls the makedata method, uses the training data to build a regres- 
sion tree, and then predicts the responses of the test set and reports the mean squared-error 
loss. 


def main(): 
X_train, X_test, y_train, y_test = makedata() 
maxdepth = 10 # maximum tree depth 
# Create tree root at depth 9 
treeRoot = TNode(0, X_train,y_train) 


# Build the regression tree with maximal depth equal to max_depth 
Construct_Subtree(treeRoot, maxdepth) 


# Predict 
y_hat = np.zeros(len(X_test)) 
for i in range(len(X_test)): 
y_hat[i] = Predict(X_test[i],treeRoot) 


MSE = np.mean(np.power(y_hat - y_test,2)) 


print("Basic tree: tree loss = ", MSE) 





The next step is to specify a tree node as a Python class. Each node has a number of 
attributes, including the features and the response data (X and y) and the depth at which 
the node is placed in the tree. The root node has depth 0. Each node w can calculate its 
contribution to the squared-error training loss X; 1{x; € R”}(y; — g"(x;))”. Note that we 
have omitted the constant 1/n term when training the tree, which simply scales the loss 
(8.2). 





class TNode: 
def __init__Cself, depth, X, y): 

self.depth = depth 
self X= X # matrix of features 
self y =y # vector of response variables 
# initialize optimal split parameters 
self.j = None 
self.xi = None 
# initialize children to be None 
self.left = None 
self.right = None 
# initialize the regional predictor 
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self.g = None 


def CalculateLoss(self): 
if(len(self.y)==0): 
return 0 


return np.sum(np.power(self.y - self.y.mean() ,2)) 





The function below implements the training (tree-building) Algorithm 8.2.1. 


def Construct_Subtree(node, max_depth): 


if(node.depth == max_depth or len(node.y) == 1): 
node.g = node.y.mean() 
else: 
j, Xi = CalculateOptimalSplit (node) 
node.j = j 
node.xi = xi 
Xt, yt, Xf, yf = DataSplit(node.X, node.y, j, xi) 


if(len(yt)>0): 
node.left = TNode(node.depth+1,Xt,yt) 
Construct_Subtree(node.left, max_depth) 


if(len(yf)>0): 
node.right = TNode(node.depth+1, Xf,yf) 
Construct_Subtree(node.right, max_depth) 


return node 





This requires an implementation of the CalculateOptimalSplit function. To start, 
we implement a function DataSp1it that splits the data according to s(x) = 1{x; < £}. 


def DataSplit(X,y,j,xi): 


ids wel <= 

Xt = X[ids True,:] 
Xf X[ids False,:] 
yt = y[ids True] 

yf = y[ids False] 
return Xt, Xf, yf 





The CalculateOptimalSplit method runs through the possible splitting thresholds 
é from the set {x;;,} and finds the optimal split. 





def CalculateOptimalSplit (node): 

X = node.X 

y = node.y 

best_var = 9 

best_xi = X[0,best_var] 
best_split_val = node.CalculateLoss() 


m, n = X.shape 


for j in range(Q,n): 
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for i in range(0,m): 
xi = X[i,j] 
Xt, yt, Xf, yf = DataSplit(X,y,j,xi) 
tmpt = TNode(0, Xt, yt) 
tmpf = TNode(0, Xf, yf) 
loss_t = tmpt.CalculateLoss() 
loss_f = tmpf.CalculateLoss() 
curr_val = loss_t + loss_f 
if Ccurr_val < best_split_val): 
best_split_val = curr_val 
best_var = j 
best_xi = xi 
return best_var, best_xi 





Finally, we implement the recursive method for prediction. 


def Predict (X,node): 
if(node.right == None and node.left != None): 
return Predict (X,node.left) 


if(node.right != None and node.left == None): 
return Predict (X,node.right) 


if(node.right == None and node.left == None): 
return node.g 
else: 
if(X[node.j] <= node.xi): 
return Predict (X,node.left) 
else: 
return Predict (X,node.right) 


Running the main function defined above gives a similar? result to what one would 


achieve with the sklearn package, using the DecisionTreeRegressor method. 


main() # run the main program 


# compare with sklearn 
from sklearn.tree import DecisionTreeRegressor 


X_train, X_test, y_train, y_test = makedata() # use the same data 
regTree = DecisionTreeRegressor(max_depth = 10, random_state=0) 


regTree. fit(X_train, y_train) 
y_hat = regTree.predict(X_test) 
MSE2 = np.mean(np.power(y_hat - y_test,2)) 


print ("DecisionTreeRegressor: tree loss = ", MSE2) 


Basic tree: tree loss = 9.067077996170276 
DecisionTreeRegressor: tree loss = 10.197991295531748 





> After establishing a best split € = x jk, Sklearn assigns the corresponding feature vector randomly to 


one of the two child nodes, rather than to the True child. 
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8.3 Additional Considerations 


8.3.1 Binary Versus Non-Binary Trees 


While it is possible to split a tree node into more than two groups (multiway splits), it 
generally produces inferior results compared to the simple binary split. The major reason 
is that multiway splits can lead to too many nodes near the tree root that have only a 
few data points, thus leaving insufficient data for later splits. As multiway splits can be 
represented as several binary splits, the latter is preferred [55]. 


8.3.2 Data Preprocessing 


Sometimes, it can be beneficial to preprocess the data prior to the tree construction. For 
example, PCA can be used with a view to identify the most important dimensions, which 
in turn will lead to simpler and possibly more informative splitting rules in the internal 
nodes. 


8.3.3 Alternative Splitting Rules 


We restricted our attention to splitting rules of the type s(x) = I{x; < €}, where j € 
{1,...,p} and £ € R. These types of rules may not always result in a simple partition 
of the feature space, as illustrated by the binary data in Figure 8.5. In this case, the feature 
space could have been partitioned into just two regions, separated by a straight line. 












































Figure 8.5: The two groups of points can here be separated by a straight line. Instead, the 
classification tree divides up the space into many rectangles, leading to an unnecessarily 
complicated classification procedure. 


In this case many classification methods discussed in Chapter 7, such as linear discrim- 
inant analysis (Section 7.4), will work very well, whereas the classification tree is rather 
elaborate, dividing the feature set into too many regions. An obvious remedy is to use 
splitting rules of the form 


s(x) = I{a'x < }. 


ee 
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In some cases, such as the one just discussed, it may be useful to use a splitting rule 
that involves several variables, as opposed to a single one. The decision regarding the split 
type clearly depends on the problem domain. For example, for logical (binary) variables 
our domain knowledge may indicate that a different behavior is expected when both x; and 
x; (i + j) are True. In this case, we will naturally introduce a decision rule of the form: 


s(x) = 1{x; = True and x; = True}. 


8.3.4 Categorical Variables 


When an explanatory variable is categorical with labels (levels) say {1,...,k}, the split- 
ting rule is generally defined via a partition of the label set {1,...,k} into two subsets. 
Specifically, let L and R be a partition of {1,...,k}. Then, the splitting rule is defined via 


s(x) = I{x; € L}. 


For the general supervised learning case, finding the optimal partition in the sense of min- 
imal loss requires one to consider 2* subsets of {1,...,k}. Consequently, finding a good 
splitting rule for categorical variables can be challenging when the number of labels p is 
large. 


8.3.5 Missing Values 


Missing data is present in many real-life problems. Generally, when working with incom- 
plete feature vectors, where one or more values are missing, it is typical to either com- 
pletely delete the feature vector from the data (which may distort the data) or to impute 
(guess) its missing values from the available data; see e.g., [120]. Tree methods, however, 
allow an elegant approach for handling missing data. Specifically, in the general case, the 
missing data problem can be handled via surrogate splitting rules [20]. 


When dealing with categorical (factor) features, we can introduce an additional cat- 


egory “missing” for the absent data. 





The main idea of surrogate rules is as follows. First, we construct a decision (regression 
or a classification) tree via Algorithm 8.2.1. During this construction process, the solution 
of the optimization problem (8.9) is calculated only over the observations that are not 
missing a particular variable. Suppose that a tree node v has a splitting rule s*(x) = 1{x; < 
é*} for some 1 < j* < p and threshold &*. 

For the node v we can introduce a set of alternative splitting rules that resemble the 
original splitting rule, sometimes called the primary splitting rule, using different variables 
and thresholds. Namely, we look for a binary splitting rule s(x | j,é), j + J“ such that the 
data split introduced by s will be similar to the original data split from s*. The similarity is 
generally measured via a binary misclassification loss, where the true classes of observa- 
tions are determined by the primary splitting rule and the surrogate splitting rules serve as 
classifiers. Consider, for example, the data in Table 8.1 and suppose that the primary split- 
ting rule at node v is 1{Age < 25}. That is, the five data points are split such that the left 
and the right child of v contains two and three data points, respectively. Next, the following 
surrogate splitting rules can be considered: 
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1. 1{Salary < 1500}, and 
2. 1{Height < 173}. 


Table 8.1: Example data with three variables (Age, Height, and Salary). 


Id Age Height Salary 

1 20 173 1000 
25 168 1500 
38 191 1700 
49 170 1900 
62 182 2000 





nb WN 


The 1{Salary < 1500} surrogate rule completely mimics the primary rule, in the sense 
that the data splits induced by these rules are identical. Namely, both rules partition the 
data into two sets (by Id) {1,2} and {3, 4,5}. On the other hand, the 1 {Height < 173} rule 
is less similar to the primary rule, since it causes the different partition {1, 2,4} and {3, 5}. 

It is up to the user to define the number of surrogate rules for each tree node. As soon as 
these surrogate rules are available, we can use them to handle a new data point, even if the 
main rule cannot be applied due to a missing value of the primary variable x;-. Specifically, 
if the observation is missing the primary split variable, we apply the first (best) surrogate 
rule. If the first surrogate variable is also missing, we apply the second best surrogate rule, 
and so on. 


8.4 Controlling the Tree Shape 


Eventually, we are interested in getting the right-size tree. Namely, a tree that shows good 
generalization properties. It was already discussed in Section 8.2.3 (Figure 8.4) that shal- 
low trees tend to underfit and deep trees tend to overfit the data. Basically, a shallow tree 
does not produce a sufficient number of splits and a deep tree will produce many partitions 
and thus many leaf nodes. If we grow the tree to a sufficient depth, each training sample 
will occupy a separate leaf and we will observe a zero loss with respect to the training data. 
The above phenomenon is illustrated in Figure 8.6, which presents the cross-validation loss 
and the training loss as a function of the tree depth. 

In order to overcome the under- and the overfitting problem, Breiman et al. [20] ex- 
amined the possibility of stopping the tree from growing as soon as the decrease in loss 
due to a split of node v, as expressed in the difference of (8.8) and (8.9), is smaller than 
some predefined parameter 6 € R. Under this setting, the tree construction process will 
terminate when no leaf node can be split such that the contribution to the training loss after 
this split is greater than ô. 

The authors found that this approach was unsatisfactory. Specifically, it was noted that a 
very small 6 leads to an excessive amount of splitting and thus causes overfitting. Increasing 
6 did not work either. The problem is that the nature of the proposed rule is one-step-look- 
ahead. To see this, consider a tree node for which the best possible decrease in loss is 
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Figure 8.6: The cross-validation and the training loss as a function of the tree depth for a 
binary classification problem. 


smaller than 6. According to the proposed procedure, this node will not be split further. This 
may, however, be sub-optimal, because it could happen that one of the node’s descendants, 
if split, could lead to a major decrease in loss. 

To address these issues, a so-called pruning routine can be employed. The idea is as 
follows. We first grow a very deep tree and then prune (remove nodes) it upwards until we 
reach the root node. Consequently, the pruning process causes the number of tree nodes 
to decrease. While the tree is being pruned, the generalization risk gradually decreases up 
to the point where it starts increasing again, at which point the pruning is stopped. This 
decreasing/increasing behavior is due to the bias—variance tradeoff (2.22). 

We next describe the details. To start with, let v and v’ be tree nodes. We say that v’ is 
a descendant of v if there is a path down the tree, which leads from v to v’. If such a path 
exists, we also say that v is an ancestor of v’. Consider the tree in Figure 8.7. 














Figure 8.7: The node vo is a descendant of v2, and vz is an ancestor of {v4, V5, V7, Vg, Vo}, but 
Vo is not a descendant of v2. 
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To formally define pruning, we will require the following Definition 8.1. An example 
of pruning is demonstrated in Figure 8.8. 


Definition 8.1: Branches and Pruning 


TREE BRANCH 1. A tree branch T, of the tree T is a sub-tree of T rooted at node v e€ T. 


2. The pruning of branch T, from a tree T is performed via deletion of the entire 


branch T, from T except the branch’s root node v. The resulting pruned tree is 
denoted by T - T,. 


3. A sub-tree T — T, is called a pruned sub-tree of T. We indicate this with the 
notation T — T, < Tor T > T-T,. 











U6 


(a) T (b) Ty, c) T-T, 








Figure 8.8: The pruned tree T — T,, in (c) is the result of pruning the T,, branch in (b) from 
the original tree T in (a). 


A basic decision tree pruning procedure is summarized in Algorithm 8.4.1. 


Algorithm 8.4.1: Decision Tree Pruning 
Input: Training set T. 
Output: Sequence of decision trees T? > T! > --- 

1 Build a large decision tree T? via Algorithm 8.2.1. [A possible termination 
criterion for that algorithm is to have some small predetermined number of data 
points at each terminal node of T°.] 

T -T° 

ke-0 

while T’ has more than one node do 

kek+1 

Choose v € T’. 

Prune the branch rooted at v from T”. 
Tt — T - T, and T + T. 


return T°, T!,...,T* 


o y A a Aà U N 
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Let T° be the initial (deep) tree and let T* be the tree obtained after the k-th pruning 
operation, for k = 1,..., K. As soon as the sequence of trees T? > T! > --- > T¥ is avail- 
able, one can choose the best tree of (T95, according to the smallest generalization risk. 
Specifically, we can split the data into training and validation sets. In this case, Algorithm 
8.4.1 is executed using the training set and the generalization risks of {T*}* | are estimated 
via the validation set. 

While Algorithm 8.4.1 and the corresponding best tree selection process look appeal- 
ing, there is still an important question to consider; namely, how to choose the node v and 
the corresponding branch T, in Line 6 of the algorithm. In order to overcome this problem, 
Breiman proposed a method called cost complexity pruning, which we discuss next. 


8.4.1 Cost-Complexity Pruning 


Let T < T° be a tree obtained via pruning of a tree T°. Denote the set of leaf (terminal) 
nodes of T by W. The number of leaves |W| is a measure for the complexity of the tree; 
recall that |W] is the number of regions {R,,} in the partition of X. Corresponding to each 
tree T is a prediction function g, as in (8.1). In cost-complexity pruning the objective is to 
find a prediction function g (or, equivalently, tree T) that minimizes the training loss ¢,(g) 
while taking into account the complexity of the tree. The idea is to regularize the training 
loss, similar to what was done in Chapter 6, by adding a penalty term for the complexity 
of the tree. This leads to the following definition. 


Definition 8.2: Cost-Complexity Measure 


Let t = {(x;, y;)}_, be a data set and y > 0 be a real number. For a given tree T, the 
cost-complexity measure C,(y, T) is defined as: 


1 n 
CYT) := = D1] 2, H: E Ru) Losso g" ŒD |+ yIW (813) 
weW \i=1 


= f+ (8) + yIW], 


where £, (g) is the training loss (8.2). 


Small values of y result in a small penalty for the tree complexity |(W], and thus large 
trees (that fit the entire training data well) will minimize the measure C,(y, T). In particular, 
for y = 0, T = T° will be the minimizer of C,(y,T). On the other hand, large values of y 
will prefer smaller trees or, more precisely, trees with fewer leaves. For sufficiently large 
y, the solution T will collapse to a single (root) node. 

It can be shown that, for every value of y, there exists a smallest minimizing sub- 
tree with respect to the cost-complexity measure. In practice, a suitable y is selected via 
observing the performance of the learner on the validation set or by cross-validation. 

These advantages and the corresponding limitations are detailed next. 





COST-COMPLEXITY 
PRUNING 


COST-COMPLEXITY 
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8.4.2 Advantages and Limitations of Decision Trees 


We list a number of advantages and disadvantages of decision trees, as compared with 
other supervised learning methods such as were discussed in Chapters 5, 6, and 7. 


Advantages 


1. 


The tree structure can handle both categorical and numerical features in a natural 
and straightforward way. Specifically, there is no need to pre-process categorical 
features, say via the introduction of dummy variables. 


. The final tree obtained after the training phase can be compactly stored for the pur- 


pose of making predictions for new feature vectors. The prediction process only 
involves a single tree traversal from the tree root to a leaf. 


. The hierarchical nature of decision trees allows for an efficient encoding of the fea- 


ture’s conditional information. Specifically, after an internal split of a feature x; via 
the standard splitting rule (8.10), Algorithm 8.2.1 will only consider such subsets of 
data that were constructed based on this split, thus implicitly exploiting the corres- 
ponding conditional information from the initial split of x;. 


. The tree structure can be easily understood and interpreted by domain experts with 


little statistical knowledge, since it is essentially a logical decision flow diagram. 


. The sequential decision tree growth procedure in Algorithm 8.2.1, and in particular 


the fact that the tree has been split using the most important features, provides an 
implicit step-wise variable elimination procedure. In addition, the partition of the 
variable space into smaller regions results in simpler prediction problems in these 
regions. 


. Decision trees are invariant under monotone transformations of the data. To see this, 


consider the (optimal) splitting rule s(x) = 1{x3 < 2}, where x3 is a positive feature. 
Suppose that x; is transformed to x, = x3. Now, the optimal splitting rule will take 
the form s(x) = 1{x} < 4}. 


. In the classification setting, it is common to report not only the predicted value of a 


feature vector, e.g., as in (8.6), but also the respective class probabilities. Decision 
trees handle this task without any additional effort. Specifically, consider a new fea- 
ture vector. During the estimation process, we will perform a tree traversal and the 
point will end up in a certain leaf w. The probability of this feature vector lying in 
class z can be estimated as the proportion of training points in w that are in class z. 


. As each training point is treated equally in the construction of a tree, their structure 


of the tree will be relatively robust to outliers. In a way, trees exhibit a similar kind 
of robustness as the sample median does for real-valued data. 


IN 
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Limitations 


Despite the fact that the decision trees are extremely interpretable, the predictive accuracy 
is generally inferior to other established statistical learning methods. In addition, decision 
trees, and in particular very deep trees that were not subject to pruning, are heavily reliant 
on their training set. A small change in the training set can result in a dramatic change of the 
resulting decision tree. Their inferior predictive accuracy, however, is a direct consequence 
of the bias—variance tradeoff. Specifically, a decision tree model generally exhibits a high 
variance. To overcome the above limitations, several promising approaches such as bag- 
ging, random forest, and boosting are introduced below. 


The bagging approach was initially introduced in the context of an ensemble of 
decision trees. However, both the bagging and the boosting methods can be applied 


to improve the accuracy of general prediction functions. 





8.5 Bootstrap Aggregation 


The major idea of the bootstrap aggregation or bagging method is to combine prediction 
functions learned from multiple data sets, with a view to improving overall prediction 
accuracy. Bagging is especially beneficial when dealing with predictors that tend to overfit 
the data, such as in decision trees, where the (unpruned) tree structure is very sensitive to 
small changes in the training set [37, 55]. 

To start with, consider an idealized setting for a regression tree, where we have access 
to B iid copies? J,,...,7% of a training set J. Then, we can train B separate regression 
models (B different decision trees) using these sets, giving learners g7,,..., 97,, and take 
their average: 


1 B 
gwela) = 5 8n). (8.14) 
b=1 


By the law of large numbers, as B — ov, the average prediction function converges to 
the expected prediction function g := Egy. The following result shows that using gt as 
a prediction function (if it were known) would result in an expected squared-error gen- 
eralization risk that is less than or equal to the expected generalization risk for a general 
prediction function gy. It thus suggests that taking an average of prediction functions may 
lead to a better expected squared-error generalization risk. 


Theorem 8.1: Expected Squared-Error Generalization Risk 








3In this section Ty means the k-th training set, not a training set of size k. 
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Proof: We have 
2 2 
e|(r -sr%) |x , 





> (zix, Y] - Elg (X)| X, ri) 7 (x T g1) 





where the inequality follows from EU? > (EU)* for any (conditional) expectation. Con- 
sequently, by the tower property, 
2 2 


BY - er(X)) =E[E|Y -8701x Y]] > (Y - 81) 
O 


Unfortunately, multiple independent data sets are rarely available. But we can substi- 
tute them by bootstrapped ones. Specifically, instead of the 71, ..., Jp sets, we can obtain 
random training sets 7/',...,7, by resampling them from a single (fixed) training set T, 
similar to Algorithm 3.2.6, and use them to train B separate models. By model averaging 
as in (8.14) we obtain the bootstrapped aggregated estimator or bagged estimator of the 
form: 


1 B 
Ewela) = =D 870) (8.15) 
b=1 


Algorithm 8.5.1: Bootstrap Aggregation Sampling 
Input: Training set T = {(x;, y;)}/_, and resample size B. 
Output: Bootstrapped data sets. 
1 for b = 1 to B do 
Ty- 0 
for i = 1 to n do 
Draw U ~ U(0, 1) 
I — [nU] // select random index 
Tp = T} U ryn). 


a nan A U N 





7 return7,;,b=1,...,B. 


E Remark 8.1 (Bootstrap Aggregation for Classification Problems) Note that (8.15) 
is suitable for handling regression problems. However, the bagging idea can be readily 
extended to handle classification settings as well. For example, gp, can take the majority 
vote among {gr },b =1,..., B; that is, to accept the most frequent class among B predict- 
ors. E 


While bagging can be applied for any statistical model (such as decision trees, neural 
networks, linear regression, K-nearest neighbors, and so on), it is most effective for pre- 
dictors that are sensitive to small changes in the training set. The reason becomes clear 
when we decompose the expected generalization risk as 


Ee(gr) = € + E (Elg (X)| X] - g (X)? + E[Yar[s7 (X)| X]], (8.16) 


expected squared bias expected variance 
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similar to (2.22). Compare this with the same decomposition for the average prediction 
function gpag in (8.14). As Egpas(x) = Egz(x), we see that any possible improvement in 
the generalization risk must be due to the expected variance term. Averaging and bagging 
are thus only useful for predictors with a large expected variance, relative to the other two 
terms. Examples of such “unstable” predictors include decision trees, neural networks, and 
subset selection in linear regression [22]. On the other hand, “stable” predictors are in- 
sensitive to small data changes, an example being the K-nearest neighbors method. Note 
that for independent training sets 71,...,7 a reduction of the variance by a factor B is 
achieved: Var gpag(x) = Bo! Var g7(x). Again, it depends on the squared bias and irredu- 
cible loss how significant this reduction is for the generalization risk. 


E Remark 8.2 (Limitations of Bagging) It is important to remember that gpa, is not ex- 
actly equal to gayg, which in turn is not exactly g*. Specifically, gpag is constructed from the 
bootstrap approximation of the sampling pdf f. As a consequence, for stable predictors, 
it can happen that gpay will perform worse than gy. In addition to the deterioration of the 
bagging performance for stable procedures, it can also happen that gy has already achieved 
a near optimal predictive accuracy given the available training data. In this case, bagging 
will not introduce a significant improvement. E 


The bagging process provides an opportunity to estimate the generalization risk of 
the bagged model without an additional test set. Specifically, recall that we obtain the 
TÝ, ..., Tp sets from a single training set t by sampling via Algorithm 8.5.1, and use them 
to train B separate models. It can be shown (see Exercise 8) that, for large sample sizes, on 
average about a third (more precisely, a fraction e~' ~ 0.37) of the original sample points 
are not included in bootstrapped set 7% for 1 < b < B. Therefore, these samples can be 
used for the loss estimation. These samples are called out-of-bag (OOB) observations. 

Specifically, for each sample from the original data set, we calculate the OOB loss us- 
ing predictors that were trained without this particular sample. The estimation procedure is 
summarized in Algorithm 8.5.2. Tibshirani et al. [55] observe that the OOB loss is almost 
identical to the n-fold cross-validation loss. In addition, the OOB loss can be used to de- 
termine the number of trees required. Specifically, we can train predictors until the OOB 
loss stops changing. Namely, decision trees are added until the OOB loss stabilizes. 


Algorithm 8.5.2: Out-of-Bag Loss Estimation 
Input: The original data set tT = {(x1, y1),..-, (Xn, Yn)}, the bootstrapped data sets 
{7;,...,J}, and the trained predictors (sr, -o BGA fe 
Output: Out-of-bag loss for the averaged model. 
1 fori = 1 to n do 
2 Cie 0 // Indices of predictors not depending on (xi, yi) 
3 for b = 1 to B do 
4 | if (x;,y:) ¢ 7; then C; — C; U {b} 


5 Y; — |C Doce, 8r; Œ) 
6 L; — Loss (yi, Y7) 

7 Loos © Xi Li 

return Loog. 





(ea) 
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E Example 8.3 (Bagging for a Regression Tree) We next proceed with a basic bagging 
example for a regression tree, in which we compare the decision tree estimator with the 
corresponding bagged estimator. We use the R? metric (coefficient of determination) for 
comparison. 


BaggingExample.py 





import numpy as np 

from sklearn.datasets import make_friedmanl 

from sklearn.tree import DecisionTreeRegressor 

from sklearn.model_selection import train_test_split 
from sklearn.metrics import r2_score 


np.random.seed(100) 


# create regression problem 

n_points = 1000 # points 

x, y = make_friedmani(n_samples=n_points, n_features=15, 
noise=1.0, random_state=100) 


# split to train/test set 
x_train, x_test, y_train, y_test = \ 
train_test_split(x, y, test_size=0.33, random_state=100) 


# training 
regTree = DecisionTreeRegressor(random_state=100) 
regTree. fit(x_train, y_train) 


# test 
yhat = regTree.predict(x_test) 


# Bagging construction 
n_estimators=500 
bag = np.empty((n_estimators), dtype=object) 
bootstrap_ds_arr = np.empty((n_estimators), dtype=object) 
for i in range(n_estimators): 

# sample bootstrapped data set 

ids = np.random.choice(range(0,len(x_test)),size=len(x_test), 

replace=True) 

x_boot = x_train[ids] 

y_boot = y_train[ids] 

bootstrap_ds_arr[i] = np.unique (ids) 


bag[i] = DecisionTreeRegressor () 
bag[i].fit(x_boot , y_boot) 


# bagging prediction 
yhatbag = np.zeros(len(y_test)) 
for i in range(n_estimators): 
yhatbag = yhatbag + bag[i].predict(x_test) 
yhatbag = yhatbag/n_estimators 


# out of bag loss estimation 
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oob_pred_arr = np.zeros(len(x_train) ) 
for i in range(len(x_train)): 
x = x_train[i].reshape(1, -1) 
€= 
for b in range(n_estimators): 
if(mp.isin(i, bootstrap_ds_arr[b])==False): 
C. append (b) 
for pred in bag[C]: 
oob_pred_arr[i] = oob_pred_arr[i] + (Cpred.predict(x)/len(C)) 


L_oob = r2_score(y_train, oob_pred_arr) 


print("DecisionTreeRegressor RA2 score = ",r2_score(y_test, yhat), 
"\nBagging R42 score = ", r2_score(y_test, yhatbag), 
"\nBagging OOB R42 score = ",L_oob) 


DecisionTreeRegressor R42 score = 0.575438224929718 
Bagging R42 score = Q.7612121189201985 


Bagging OOB R42 score = 0Q.7758253149069059 





The decision tree bagging improves the test-set R? score by about 32% (from 0.575 
to 0.761). Moreover, the OOB score (0.776) is very close to the true generalization risk 
(0.761) of the bagged estimator. a 


The bagging procedure can be further enhanced by introducing random forests, which 
is discussed next. 


8.6 Random Forests 


In Section 8.5, we discussed the intuition behind the prediction averaging procedure. Spe- 
cifically, for some feature vector x let Z, = g7,(x),b = 1,2,...,B be iid prediction val- 
ues, obtained from independent training sets T1, ..., Tg. Suppose that Var Z, = g? for all 
b =1,...,B. Then the variance of the average prediction value Zg is equal to 0?/B. How- 
ever, if bootstrapped data sets {7;"} are used instead, the corresponding random variables 
{Zp} will be correlated. In particular, Zp, = ET; (x) for b = 1,...,B are identically distrib- 
uted (but not independent) with some positive pairwise correlation o. It then holds that (see 
Exercise 9) 


z (ee 
E eo =o (8.17) 





While the second term of (8.17) goes to zero as the number of observation B increases, the 
first term remains constant. 

This issue is particularly relevant for bagging with decision trees. For example, con- 
sider a situation in which there exists a feature that provides a very good split of the data. 
Such a feature will be selected and split for every lerh at the root level and we will 
consequently end up with highly correlated predictions. In such a situation, prediction 
averaging will not introduce the desired improvement in the performance of the bagged 


predictor. 
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The major idea of random forests is to perform bagging in combination with a “decor- 
relation” of the trees by including only a subset of features during the tree construction. For 
each bootstrapped training set 7," we build a decision tree using a randomly selected subset 
of m < p features for the splitting rules. This simple but powerful idea will decorrelate the 
trees, since strong predictors will have a smaller chance to be considered at the root levels. 

Consequentially, we can expect to improve the predictive performance of the bagged 
estimator. The resulting predictor (random forest) construction is summarized in Algorithm 
8.6.1. 


Algorithm 8.6.1: Random Forest Construction 
Input: Training set T = {(x;, y;)};_,, the number of trees in the forest B, and the 
number m < p of features to be included, where p is the total number of 
features in x. 
Output: Ensemble of trees. 
1 Generate bootstrapped training sets {T}, ..., T }} via Algorithm 8.5.1. 
2 for b = 1 to B do 
3 Randomly select m out of p features, without replacement. 
4 f Using only these features, train a decision tree gy» via Algorithm 8.2.1. 


s return {g7-}?_,. 


For regression problems, the output of Algorithm 8.6.1 is combined to yield the random 
forest prediction function: 


1 B 
gre(x) = 5 2870). 
b=1 


In the classification setting, similar to Remark 8.1, we take instead the majority vote from 
the {gz}. 


E Example 8.4 (Random Forest for a Regression Tree) We continue with the basic 
bagging Example 8.3 for a regression tree, in which we compared the decision tree es- 
timator with the corresponding bagged estimator. Here, however, we use the random forest 
with B = 500 trees and a subset size m = 8. It can be seen that the random forest’s R? score 
is outperforming that of the bagged estimator. 


BaggingExampleRF .py 


from sklearn.datasets import make_friedman1 

from sklearn.model_selection import train_test_split 
from sklearn.metrics import r2_score 

from sklearn.ensemble import RandomForestRegressor 


# create regression problem 
n_points = 1000 # points 
x, y = make_friedmani(n_samples=n_points, n_features=15, 
noise=1.0, random_state=100) 
# split to train/test set 
x train, x test, y_train, y test = \ 
train_test_split(x, y, test_size=0.33, random_state=100) 
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rf = RandomForestRegressor(n_estimators=500, oob_score = True, 
max_features=8, random_state=100) 
rf.fit(x_train,y_train) 
yhatrf = rf.predict(x_test) 
print("RF R42 score = ", r2_score(y_test, yhatrf), 
"\nRF OOB R42 score = ", rf.oob_score_) 


RF R42 score = Q.8106589580845707 


RF OOB R42 score = Q.8260541058404149 





E Remark 8.3 (The Optimal Number of Subset Features m) The default values for m 
are | p/3]| and | vP| for regression and classification setting, respectively. However, the 
standard practice is to treat m as a hyperparameter that requires tuning, depending on the 
specific problem at hand [55]. oO 


Note that the procedure of bagging decision trees is a special case of a random forest 
construction (see Exercise 11). Consequently, the OOB loss is readily available for random 
forests. 

While the advantage of bagging in the sense of enhanced accuracy is clear, we should 
also consider its negative aspects and, in particular, the loss of interpretability. Specifically 
a random forest consists of many trees, thus making the prediction process both hard to 
visualize and interpret. For example, given a random forest, it is not easy to determine a 
subset of features that are essential for accurate prediction. 

The feature importance measure intends to address this issue. The idea is as follows. 
Each internal node of a decision tree induces a certain decrease in the training loss; see 
(8.9). Let us denote this decrease in the training loss by Ajoss(v), where v is not a leaf node 
of T. In addition, recall that for splitting rules of the type 1{x; < €} (1 < j < p), each node 
v is associated with a feature x; that determines the split. Using the above definitions, we 
can define the feature importance of x; as 


T(x; = >, Atoss(V) Lx; is associated with v}, 1< j< p. (8.18) 
v internal €T 


While (8.18) is defined for a single tree, it can be readily extended to random forests. 
Specifically, the feature importance in that case will be averaged over all trees of the forest; 


that is, for a forest consisting of B trees {T;,..., Tg}, the feature importance measure is: 
iv 
Tre(x)) = 5 DF l<j<p. (8.19) 


m Example 8.5 (Feature Importance) We consider a classification problem with 15 fea- 
tures. The data is specifically designed to contain only 5 informative features out of 15. 
In the code below, we apply the random forest procedure and calculate the corresponding 
feature importance measures, which are summarized in Figure 8.9. 


FEATURE 
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VarImportance.py 


import numpy as np 

from sklearn.datasets import make_classification 
from sklearn.ensemble import RandomForestClassifier 
import matplotlib.pyplot as plt, pylab 


n_points = 1000 # create regression data with 1000 data points 
x, y = make_classification(n_samples=n_points, n_features=15, 
n_informative=5, n_redundant=0, n_repeated=0, random_state=100, 
shuffle=False) 


rf = RandomForestClassifier(n_estimators=200, max_features="log2") 
rf. fit(x,y) 


importances = rf.feature_importances_ 
indices = np.argsort(importances)[::-1] 


for f in range(15): 
print("Feature %d (%f)" % Cindices[f]+1, importances[indices[f 
]])) 


std = np.std([rf.feature_importances_ for tree in rf.estimators_], 
axis=0) 
f = plt.figure() 
plt.bar(range(x.shape[1]), importances[indices], 
color="b", yerr=std[indices], align="center") 
plt.xticks(range(x.shape[1]), indices+1) 
plt.xlim([-1, x.shape[1]]) 
pylab.xlabel("feature index") 
pylab.ylabel ("importance") 
plt.show() 
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Figure 8.9: Importance measure for the 15-feature data set with only 5 informative features 
X1, X2, X3, X4, and xs. 
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Clearly, it is hard to visualize and understand the prediction process based on 200 trees. 
However, Figure 8.9 shows that the features x1, x2, X3, x4, and xs were correctly identified 
as being important. E 


8.7 Boosting 


Boosting is a powerful idea that aims to improve the accuracy of any learning algorithm, 
especially when involving weak learners — simple prediction functions that exhibit per- 
formance slightly better than random guessing. Shallow decision trees typically yield weak 
learners. 

Originally, boosting was developed for binary classification tasks, but it can be readily 
extended to handle general classification and regression problems. The boosting approach 
has some similarity with the bagging method in the sense that boosting uses an ensemble of 
prediction functions. Despite this similarity, there exists a fundamental difference between 
these methods. Specifically, while bagging involves the fitting of prediction functions to 
bootstrapped data, the predicting functions in boosting are learned sequentially. That is, 
each learner uses information from previous learners. 

The idea is to start with a simple model (weak learner) go for the data T = {(x;, yD} 
and then to improve or “boost” this learner to a learner gı := go + hı. Here, the function hı 
is found by minimizing the training loss for go + hı over all functions h in some class of 
functions H. For example, H could be the set of prediction functions that can be obtained 
via a decision tree of maximal depth 2. Given a loss function Loss, the function A; is thus 
obtained as the solution to the optimization problem 


1 n 
hı = argmin — >, Loss (y;, 9(X;) + h (x;)). (8.20) 
heH n i=l 


This process can be repeated for g; to obtain g2 = gı + hz, and so on, yielding the boosted 
prediction function 


B 
gel) = go + X h). (8.21) 
b=1 


Instead of using the updating step g, = gp-1 + hp, one prefers to use the smooth updating 
step g, = 8p-1 + yy, for some suitably chosen step-size parameter y. As we shall see 
shortly, this helps reduce overfitting. 

Boosting can be used for regression and classification problems. We start with a simple 
regression setting, using the squared-error loss; thus, Loss(y, y) = (y — y)’. In this case, it 
is common to start with g(x) = n~! ©, y; and each h, for b = 1,...,B is chosen as a 
learner for the data set t, of residuals corresponding to gp-1. That is, tT, := {(xi, a |e 
with 

PO cis, 2. 
i (= Yi — Bo-1(%i). (8.22) 


This leads to the following boosting procedure for regression with squared-error loss. 


WEAK LEARNERS 
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Algorithm 8.7.1: Regression Boosting with Squared-Error Loss 
Input: Training set T = {(x;, y;)};_,, the number of boosting rounds B, and a 
shrinkage step-size parameter y. 
Output: Boosted prediction function. 
1 Set go(x) <n! YL: 
2 for b = 1 to B do 
3 Set a” | yi — 8o-1(x;) fori =1,...,n, and let T, — (er) 
4 Fit a prediction function h, on the training data Tp. 
5 | Set go(x) — gr- (x) + y hpx). 
6 return gz. 
STEP-SIZE The step-size parameter y introduced in Algorithm 8.7.1 controls the speed of the 


PARAMETER Y 


fitting process. Specifically, for small values of y, boosting takes smaller steps to- 


wards the training loss minimization. The step-size y is of great practical import- 
ance, since it helps the boosting algorithm to avoid overfitting. This phenomenon is 
demonstrated in Figure 8.10. 
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Figure 8.10: The left and the right panels show the fitted boosting regression model g 1000 
with y = 1.0 and y = 0.005, respectively. Note the overfitting on the left. 


A very basic implementation of Algorithm 8.7.1 which reproduces Figure 8.10 is 
provided below. 


RegressionBoosting.py 






import numpy as np 

from sklearn.tree import DecisionTreeRegressor 

from sklearn.model_selection import train_test_split 
from sklearn.datasets import make_regression 

import matplotlib.pyplot as plt 










def TrainBoost(alpha,BoostingRounds ,x,y): 
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g_9 = np.mean(y) 
residuals = y-alpha*g_9 


# list of basic regressor 
g_boost = [] 


for i in range(BoostingRounds): 
h_i = DecisionTreeRegressor (max_depth=1) 
h_i.fit(x,residuals) 
residuals = residuals - alpha*h_i.predict(x) 
g_boost.append(h_i) 


return g_9, g_boost 


def Predict(g_9, g_boost,alpha, x): 
yhat = alpha*g_0*np.ones(len(x)) 
for j in range(len(g_boost)): 
yhat = yhat+alpha*g_boost[j].predict (x) 


return yhat 


np.random.seed(1) 
sz = 30 


# create data set 
x,y = make_regression(n_samples=sz, n_features=1, n_informative=1, 
noise=10.0) 


# boosting algorithm 
BoostingRounds = 1000 
alphas = [1, 0.005] 


for alpha in alphas: 
g_9, g_boost = TrainBoost (alpha ,BoostingRounds ,x,y) 
yhat = Predict(g_9, g_boost, alpha, x) 


# plot 

tmpX = np.reshape(np.linspace(-2.5,2,1000) ,(1000,1)) 
yhatX = Predict(g_9, g_boost, alpha, tmpX) 

f = plt.figure() 

plt.plot(x,y,'*') 

plt.plot(tmpX, yhatX) 

plt.show() 





The parameter y can be viewed as a step size made in the direction of the negative 
gradient of the squared-error training loss. To see this, note that the negative gradient 


POER se = 2: = g6-1()) 
Oz z=8p-1 (Xi) dz 2=gp-1(%i) 


is two times the residual go given in (8.22) that is used in Algorithm 8.7.1 to fit the pre- 
diction function Ap. 
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In fact, one of the major advances in the theory of boosting was the recognition that 
one can use a similar gradient descent method for any differentiable loss function. The 
resulting algorithm is called gradient boosting. The general gradient boosting algorithm is 
summarized in Algorithm 8.7.2. The main idea is to mimic a gradient descent algorithm 
in the following sense. At each stage of the boosting procedure, we calculate a negative 
gradient on n training points x;,...,x, (Lines 3-4). Then, we fit a simple model (such as 
a shallow decision tree) to approximate the gradient (Line 5) for any feature x. Finally, 
similar to the gradient descent method, we make a y-sized step in the direction of the 
negative gradient (Line 6). 


Algorithm 8.7.2: Gradient Boosting 
Input: Training set t = {(x;, y;)};_,, the number of boosting rounds B, a 
differentiable loss function Loss(y, y), and a gradient step-size parameter y. 
Output: Gradient boosted prediction function. 
1 Set go(x) — 0. 
2 for b = 1 to B do 
3 for i = 1 to n do 


4 Evaluate the negative gradient of the loss at (x;, y;) via 
(b) ð Loss (y;, z) , 
r e- —— i=1,...,n. 
Oz 2=8p-1(%i) 





5 Approximate the negative gradient by solving 


h, = argmin 1 » (re -— [ep1 (x) +h œ). (8.23) 


hH P” i=0 





6 | Set g(x) — gp-1(x) + y h(x). 
7 return gg 


E Example 8.6 (Gradient Boosting for a Regression Tree) Let us continue with the ba- 
sic bagging and random forest examples for a regression tree (Examples 8.3 and 8.4), where 
we compared the standard decision tree estimator with the corresponding bagging and ran- 
dom forest estimators. Now, we use the gradient boosting estimator from Algorithm 8.7.2, 
as implemented in sklearn. We use y = 0.1 and perform B = 100 boosting rounds. As 
a prediction function h, for b = 1,...,B we use small regression trees of depth at most 
3. Note that such individual trees do not usually give good performance; that is, they are 
weak prediction functions. We can see that the resulting boosting prediction function gives 
the R? score equal to 0.899, which is better than R? scores of simple decision tree (0.5754), 
the bagged tree (0.761), and the random forest (0.8106). 


GradientBoostingRegression. py 





import numpy as np 
from sklearn.datasets import make_friedmanl 
from sklearn.tree import DecisionTreeRegressor 
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from sklearn.model_selection import train_test_split 
from sklearn.metrics import r2_score 


# create regression problem 

n_points = 1000 # points 

x, y = make_friedmani(n_samples=n_points, n_features=15, 
noise=1.0, random_state=100) 


# split to train/test set 
x train, x test, y_ train, y- test = \ 
train_test_split(x, y, test_size=0.33, random_state=100) 


# boosting sklearn 
from sklearn.ensemble import GradientBoostingRegressor 


breg = GradientBoostingRegressor(learning_rate=0.1, 
n_estimators=100, max_depth =3, random_state=100) 

breg. fit(x_train,y_train) 

yhat = breg.predict(x_test) 

print("Gradient Boosting R^2 score = ",r2_score(y_test, yhat)) 


Gradient Boosting R42 score = Q.8993055635639531 





We proceed with the classification setting and consider the original boosting algorithm: 
AdaBoost. The inventors of the AdaBoost method considered a binary classification prob- ApaBoost 
lem, where the response variable belongs to the {—1, 1} set. The idea of AdaBoost is similar 
to the one presented in the regression setting, that is, AdaBoost fits a sequence of prediction 
functions go, 2) = go + h1, 82 = go +h + m,... with final prediction function 
B 
h(x), (8.24) 


1 


88X) = 8o + 
b 
where each function h, is of the form h(x) = œp c(x), with a, € R+, and where c, is a 
proper (but weak) classifier in some class C. Thus, c,(x) € {—1, 1}. Exactly as in (8.20), we 
solve at each boosting iteration the optimization problem 


n 


1 
(ap, cp) = argmin — X Loss Yi, 85-11%) + a(x). (8.25) 


n 
a>0, ceC Ei 


However, in this case the loss function is defined as Loss(y,y) = e”. The algorithm starts 
with a simple model go := 0 and for each successive iteration b = 1,...,B solves (8.25). 
Thus, 


n n 


: y; Ke ; ‘ by. ancl 
(Œb, Cp) = argmin ) e810) eee) — aremin ò wP edie ee) 
—— i 
a>0, ceC izl az0, ceC i=l 


„O 
where w” := exp{—y; 8p-1(x;)} does not depend on a or c. It follows that 
(ap, cp) = argmine™ X wP Hea) = yi} +e” X wP Lea) # yi) 
az>0, ceC i=] = 
= argmin (e” —e“) fc) +e", (8.26) 


a20, ceC 


318 


8.7. Boosting 





where 
Di w Hea) # yi} 
pe wP 
can be interpreted as the weighted zero—one training loss at iteration b. 
For any a > 0, the program (8.26) is minimized by a classifier c € C that minimizes 
this weighted training loss; that is, 


£0) = 


cp(x) = argmin £®. (8.27) 
ceC 


Substituting (8.27) into (8.26) and solving for the optimal œ gives 


1 (1-0) 
Ap = 7 (a). (8.28) 


This gives the AdaBoost algorithm, summarized below. 


Algorithm 8.7.3: AdaBoost 
Input: Training set T = {(x;, y;)}_,, and the number of boosting rounds B. 
Output: AdaBoost prediction function. 

1 Set go(x) — 0. 

2 fori = 1 ton do 

3 wi? <—1/n 

4 for b = 1 to B do 

5 Fit a classifier c, on the training set t by solving 


n (b) 
Xw; Lfelx;) + yi 
Ch = argmin o) = argmin al” { ( ) yi} 





ceC ceC aw 
1-2%c 
6 Set a, <— jin // Update weights 
(As (cp) 
7 for i = 1 to n do 
$ L wet) — w exp{-y; a calx). 





return gg(x) := X2] ap p(x). 


© 


Algorithm 8.7.3 is quite intuitive. At the first step (b = 1), AdaBoost assigns an equal 
weight wi? = 1/n to each training sample (x;, y;) in the set tT = {(x;, y;)};_,. Note that, in 
this case, the weighted zero—one training loss is equal to the regular zero—one training loss. 
At each successive step b > 1, the weights of observations that were incorrectly classified 
by the previous boosting prediction function g, are increased, and the weights of correctly 
classified observations are decreased. Due to the use of the weighted zero—one loss, the set 
of incorrectly classified training samples will receive an extra weight and thus have a better 
chance of being classified correctly by the next classifier c,,;. As soon as the AdaBoost 
algorithm finds the prediction function gz, the final classification is delivered via 


B 
sign [> (oan ao] : 


b=1 
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The step-size parameter a, found by the AdaBoost algorithm in Line 6 can be 
viewed as an optimal step-size in the sense of training loss minimization. How- 
ever, similar to the regression setting, one can slow down the AdaBoost algorithm 


by setting a, to be a fixed (small) value a, = y. As usual, when the latter is done in 
practice, it is tackling the problem of overfitting. 





We consider an implementation of Algorithm 8.7.3 for a binary classification problem. 
Specifically, during all boosting rounds, we use simple decision trees of depth 1 (also called 
decision tree stumps) as weak learners. The exponential and zero—one training losses as a 
function of the number of boosting rounds are presented in Figure 8.11. 


AdaBoost.py 






from sklearn.datasets import make_blobs 

from sklearn.tree import DecisionTreeClassifier 

from sklearn.model_selection import train_test_split 
from sklearn.metrics import zero_one_loss 

import numpy as np 











def ExponentialLoss(y,yhat): 
n = len(y) 
loss = 9 
for i in range(n): 
loss = loss+np.exp(-y[i]*yhat[i]) 
loss = loss/n 
return loss 










# create binary classification problem 
np.random.seed(100) 






n_points = 100 # points 
x, y = make_blobs(n_samples=n_points, n_features=5, centers=2, 
cluster_std=20.0, random_state=100) 






yly==0]=-1 






# AdaBoost implementation 
BoostingRounds = 1000 

n = len(x) 

W = 1/n*np.ones(n) 








Learner = [] 
alpha_b_arr = [] 






for i in range(BoostingRounds): 
clf = DecisionTreeClassifier (max_depth=1) 
clf.fit (x,y, sample_weight=W) 








Learner. append(clf) 






train_pred = clf.predict(x) 
err_b = 0 






STUMPS 
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for i in range(n): 
if(train_pred[i]!=y[i]): 
err_b = err_b+W[il] 


err_b = err_b/np.sum(W) 
alpha_b = 0.5*np.log((l-err_b)/err_b) 
alpha_b_arr.append(alpha_b) 


for i in range(n): 
WLi] = W[i]*np.exp(-y[i]*alpha_b*train_pred[i]) 


yhat_boost = np.zeros(len(y)) 
for j in range(BoostingRounds): 


yhat_boost = yhat_boost+alpha_b_arr[j]*Learner[j].predict (x) 


yhat = np.zeros(n) 
yhat[yhat_boost>=0] = 1 


yhat[yhat_boost<0] = -1 

print ("AdaBoost Classifier exponential loss = ", ExponentialLoss(y, 
yhat_boost)) 

printC"AdaBoost Classifier zero--one loss = ",zero_one_loss(y,yhat)) 


AdaBoost Classifier exponential loss = Q.004224013663777142 


AdaBoost Classifier zero--one loss = 0.0 
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Figure 8.11: Exponential and zero—one training loss as a function of the number of boosting 
rounds B for a binary classification problem. 
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Further Reading 


Breiman’s book on decision trees, [20], serves as a great starting point. Some additional 
advances can be found in [62, 96]. From the computational point of view, there exists 
an efficient recursive procedure for tree pruning; see Chapters 3 and 10 in [20]. Several 
advantages and disadvantages of using decision trees are debated in [37, 55]. A detailed 
discussion on bagging and random forests can be found in [21] and [23], respectively. 
Freund and Schapire [44] provide the first boosting algorithm, the AdaBoost. While Ad- 
aBoost was developed in the context of the computational complexity of learning, it was 
later discovered by Friedman [45] that AdaBoost is a special case of an additive model. 
In addition, it was shown that for any differentiable loss function, there exists an efficient 
boosting procedure which mimics the gradient descent algorithm. The foundation of the 
resulting gradient boosting method is detailed in [45]. Python packages that implement 
gradient boosting include XGBoost and LightGBM. 


Exercises 


1. Show that any training set tT = {(x,y;),i = 1,...,n} can be fitted via a tree with zero 
training loss. 


2. Suppose during the construction of a decision tree we wish to specify a constant re- 
gional prediction function g” on the region &,,, based on the training data in R,, say 
{(X1,1),-- (Xr Y~)}. Show that g"(x) := k! D2 y; minimizes the squared-error loss. 


3. Using the program from Section 8.2.4, write a basic implementation of a decision tree 
for a binary classification problem. Implement the misclassification, Gini index, and en- 
tropy impurity criteria to split nodes. Compare the results. 


4. Suppose in the decision tree of Example 8.1, there are 3 blue and 2 red data points in 
a certain tree region. Calculate the misclassification impurity, the Gini impurity, and the 
entropy impurity. Repeat these calculations for 2 blue and 3 red data points. 


5. Consider the procedure of finding the best splitting rule for a categorical variable with 
k labels from Section 8.3.4. Show that one needs to consider 2‘ subsets of {1,...,k} to find 
the optimal partition of labels. 


6. Reproduce Figure 8.6 using the following classification data. 


from sklearn.datasets import make_blobs 


X, y = make_blobs(n_samples=5000, n_features=10, centers=3, 
random_state=10, cluster_std=10) 





7. Prove (8.13); that is, show that 


D>, | Dd) Mai E Roh Losong") =nb(g). 


weW \ i=1 
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8. Suppose T is a training set with n elements and T*, also of size n, is obtained from T 
by bootstrapping; that is, resampling with replacement. Show that for large n, t* does not 
contain a fraction of about e~! ~ 0.37 of the points from 7. 


9. Prove Equation (8.17). 


10. Consider the following training/test split of the data. Construct a random forest re- 
gressor and identify the optimal subset size m in the sense of R? score (see Remark 8.3). 


import numpy as np 

from sklearn.datasets import make_friedmanl 

from sklearn.tree import DecisionTreeRegressor 

from sklearn.model_selection import train_test_split 
from sklearn.metrics import r2_score 


# create regression problem 


n_points = 1000 # points 
x, y = make_friedmani(n_samples=n_points, n_features=15, 
noise=1.0, random_state=100) 


# split to train/test set 
x_train, x_test, y_train, y_test = \ 
train_test_split(x, y, test_size=0.33, random_state=100) 





11. Explain why bagging decision trees are a special case of random forests. 
12. Show that (8.28) holds. 
13. Consider the following classification data and module imports: 


from sklearn.datasets import make_blobs 

from sklearn.metrics import zero_one_loss 

from sklearn.model_selection import train_test_split 
import numpy as np 

import matplotlib.pyplot as plt 

from sklearn.ensemble import GradientBoostingClassifier 


X_train, y_train = make_blobs(n_samples=5000, n_features=10, 
centers=3, random_state=10, cluster_std=5) 





Using the gradient boosting algorithm with B = 100 rounds, plot the training loss as a 
function of y, for y = 0.1,0.3,0.5,0.7, 1. What is your conclusion regarding the relation 
between B and y? 


CHAPTER 9 





DEEP LEARNING 





In this chapter, we show how one can construct a rich class of approximating func- 
tions called neural networks. The learners belonging to the neural-network class of 
functions have attractive properties that have made them ubiquitous in modern machine 
learning applications — their training is computationally feasible and their complexity 
is easy to control and fine-tune. 


9.1 Introduction 


In Chapter 2 we described the basic supervised learning task; namely, we wish to predict a 
random output Y from a random input X, using a prediction function g : x + y that belongs 
to a suitably chosen class of approximating functions G. More generally, we may wish to 
predict a vector-valued output y using a prediction function g : x }> y from class G. 


In this chapter y denotes the vector-valued output for a given input x. This differs 


from our previous use (e.g., in Table 2.1), where y denotes a vector of scalar outputs. 





In the machine learning context, the class G is sometimes referred to as the hypothesis 
space or the universe of possible models, and the representational capacity of a hypothesis 
space G is simply its complexity. 

Suppose that we have a class of functions Gz, indexed by a parameter L that controls 
the complexity of the class, so that Gt C Grit C Griz C -::. In selecting a suitable 
class of functions, we have to be mindful of the approximation—estimation tradeoff. On the 
one hand, the class Gz must be complex (rich) enough to accurately represent the optimal 
unknown prediction function g*, which may require a very large L. On the other hand, the 
learners in the class G; must be simple enough to train with small estimation error and 
with minimal demands on computer memory, which may necessitate a small L. 

In balancing these competing objectives, it helps if the more complex class Gz41 is 
easily constructed from an already existing and simpler Gz. The simpler class of functions 
Gx, may itself be constructed by modifying an even simpler class G;_,, and so on. 

A class of functions that permits such a natural hierarchical construction is the class of 
neural networks. Conceptually, a neural network with L layers is a nonlinear parametric 
regression model whose representational capacity can easily be controlled by L. 
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Alternatively, in (9.3) we will define the output of a neural network as the repeated 
composition of linear and (componentwise) nonlinear functions. As we shall see, this rep- 
resentation of the output will provide a flexible class of nonlinear functions that can be 
easily differentiated. As a result, the training of learners via gradient optimization methods 
involves mostly standard matrix operations that can be performed very efficiently. 

Historically, neural networks were originally intended to mimic the workings of the 
human brain, with the network nodes modeling neurons and the network links modeling 
the axons connecting neurons. For this reason, rather than using the terminology of the 
regression models in Chapter 5, we prefer to use a nomenclature inspired by the apparent 
resemblance of neural networks to structures in the human brain. 

We note, however, that the attempts at building efficient machine learning algorithms by 
mimicking the functioning of the human brain have been as unsuccessful as the attempts 
at building flying aircraft by mimicking the flapping of birds’ wings. Instead, many ef- 
fective machine algorithms have been inspired by age-old mathematical ideas for function 
approximation. One such idea is the following fundamental result (see [119] for a proof). 


Theorem 9.1: Kolmogorov (1957) 





This result tells us that any continuous high-dimensional map can be represented as 
the function composition of much simpler (one-dimensional) maps. The composition of 
the maps needed to compute the output g*(x) for a given input x € R? are depicted in 
Figure 9.1, showing a directed graph or neural network with three layers, denoted as l = 
0,1,2. 








hpq 


Figure 9.1: Every continuous function g* : [0, 1]? — R can be represented by a neural net- 
work with one hidden layer (/ = 1), an input layer (/ = 0), and an output layer (/ = 2). 
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In particular, each of the p components of the input x is represented as a node in the 
input layer (l = 0). In the hidden layer (l = 1) there are q := 2p + 1 nodes, each of which 
is associated with a pair of variables (z, a) with values 


P 
Zj = > hj (xj) and aj = h;(z;). 


i=1 


A link between nodes (z;,a;) and x; with weight h;; signifies that the value of z; depends 
on the value of x; via the function h;;. Finally, the output layer (l = 2) represents the value 
g(x) = La a;. Note that the arrows on the graph remind us that the sequence of the 
computations is executed from left to right, or from the input layer / = O through to the 
output layer / = 2. 

In practice, we do not know the collection of functions {h;, hij}, because they depend 
on the unknown g*. In the unlikely event that g* is linear, then all of the (2p + 1)(p + 1) 
one-dimensional functions will be linear as well. However, in general, we should expect 
that each of the functions in {h;, hij} is nonlinear. 

Unfortunately, Theorem 9.1 only asserts the existence of {h;, hij}, and does not tell us 
how to construct these nonlinear functions. One way out of this predicament is to replace 
these (2p + 1)(p + 1) unknown functions with a much larger number of known nonlinear 
functions called activation functions. For example, a logistic activation function is 


S) = (1 + exp(-z))1. 


We then hope that such a network, built from a sufficiently large number of activation 
functions, will have similar representational capacity as the neural network in Figure 9.1 
with (2p + 1)(p + 1) functions. 

In general, we wish to use the simplest activation functions that will allow us to build 
a learner with large representational capacity and low training cost. The logistic function 
is merely one possible choice for an activation function from among infinite possibilit- 
ies. Figure 9.2 shows a small selection of activation functions with different regularity or 
smoothness properties. 





Heaviside or unit step rectified linear unit (ReLU) logistic 
1 — 3 1 
2 
0.5 0. 
1 
0 0 0 
-2 0 2 -2 0 2 -2 0 2 
1{z > 0} zx I{z > 0} (1 + exp(—z))"! 


Figure 9.2: Some common activation functions S (z) with their defining formulas and plots. 
The logistic function is an example of a sigmoid (that is, an S-shaped) function. Some 
books define the logistic function as 2S (z) — 1 (in terms of our definition). 





' Activation functions derive their name from models of a neuron’s response when exposed to chemical 
or electric stimuli. 
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In addition to choosing the type and number of activation functions in a neural network, 
we can improve its representational capacity in another important way: introduce more 
hidden layers. In the next section we explore this possibility in detail. 


9.2 Feed-Forward Neural Networks 


In a neural network with L+1 layers, the zero or input layer (J = 0) encodes the input feature 
vector x, and the last or output layer (J = L) encodes the (multivalued) output function g(x). 
The remaining layers are called hidden layers. Each layer has a number of nodes, say p; 


nodes for layer / = 0,..., L. In this notation, po is the dimension of the input feature vector 
x and, for example, p, = | signifies that g(x) is a scalar output. All nodes in the hidden 
layers (J = 1,...,L — 1) are associated with a pair of variables (z,a), which we gather 


into p,-dimensional column vectors z; and a;. In the so-called feed-forward networks, the 
variables in any layer / are simple functions of the variables in the preceding layer / — 1. In 
particular, z; and a; are related via the linear relation z; = W; a- + b;, for some weight 
matrix W, and bias vector bı. 

Within any hidden layer / = 1,...,L — 1, the components of the vectors z; and a; 
are related via a; = S;(z;), where S; : R” => R” is a nonlinear multivalued function. All of 
these multivalued functions are typically of the form 


S(z) = [S (z1), ..-, S Camo), F=1,...,L-1, (9.1) 


where S is an activation function common to all hidden layers. The function Sz : RP=! => 
R?+ in the output layer is more general and its specification depends, for example, on 
whether the network is used for classification or for the prediction of a continuous output 
Y. A four-layer (L = 3) network is illustrated in Figure 9.3. 


Input layer Hidden layers Output layer 


Q bias Q Q 





1 
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Figure 9.3: A neural network with L = 3: the / = 0 layer is the input layer, followed by two 
hidden layers, and the output layer. Hidden layers may have different numbers of nodes. 
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The output of this neural network is determined by the input vector x, (nonlinear) 
functions {S;}, as well as weight matrices W, = [w;;;] and bias vectors b; = [b,j] for 
l= 1,2,3. 


Here, the (i, j)-th element of the weight matrix W, = [wj] is the weight that con- 


nects the j-th node in the /-th layer with the i-th node in the (/ + 1)-st layer. 





The name given to L (the number of layers without the input layer) is the network depth 
and max; p; is called the network width. While we mostly study networks that have an equal 
number of nodes in the hidden layers (pı = --- = pr-1), in general there can be different 
numbers of nodes in each hidden layer. 

The output g(x) of a multiple-layer neural network is obtained from the input x via the 
following sequence of computations: 


x_ > Wia +b > Si(Z1) > Wr a) + by > Sx(Z) > -+> 
=y — -———- =< — YS 


ao Z1 a 22 an 


g (9.2) 
> War- + by > Si(Z1) = g(x). 
SS =A 
ZL aL 


Denoting the function z + W;z + b; by M,, the output g(x) can thus be written as the 
function composition 


g(x) = S° Mz o---°0 80 MoS; 0 M(x). (9.3) 


The algorithm for computing the output g(x) for an input x is summarized next. Note 
that we leave open the possibility that the activation functions {S;} have different definitions 
for each layer. In some cases, S; may even depend on some or all of the already computed 
Z1,Z2,... and @,a),.... 


Algorithm 9.2.1: Feed-Forward Propagation for a Neural Network 
input: Feature vector x; weights {w;;;j}, biases {b,;} for each layer / = 1,...,L. 
output: The value of the prediction function g(x). 

1 d&g x // the zero or input layer 

2 for l= 1 to L do 

3 Compute the hidden variable z;; for each node i in layer /: 


zı — Wa +b; 


4 Compute the activation function a;; for each node i in layer l: 


a; — S(Z) 





5 return g(x) < az // the output layer 
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E Example 9.1 (Nonlinear Multi-Output Regression) Given the input x € R” and an 
activation function S : R > R, the output g(x) := [g(x),.-.,2p,(x)]" of a nonlinear multi- 
output regression model can be computed via a neural network with: 


Zz, = W,x+b,, where W, € RP? b; e R”, 
dik = S (zik), k=1,. -> P1; 
g(x) = Wa; +b, where W, € RP?”!, b, € R”, 


which is a neural network with one hidden layer and output function S2(z) = z. In the 
special case where pı = p2 = 1, b2 = 0, W2 = 1, and we collect all parameters into the 
vector 0° = [b;, W1] € R”?+!, the neural network can be interpreted as a generalized linear 
model with E[Y | X = x] = A([1,x'] 0) for some activation function A. Oo 


E Example 9.2 (Multi-Logit Classification) Suppose that, for a classification problem, 
an input x has to be classified into one of c classes, labeled 0,...,c—1. We can perform the 
classification via a neural network with one hidden layer, with pı = c nodes. In particular, 
we have 

z= W,x+d,, a, =S8,(21), 


where S; is the softmax function: 


exp(Z) 


softmax : z > ————.. 
Diz EXP(Z) 


For the output, we take g(x) = [g)(x),...,g-(x)]' = a, which can then be used as a 
pre-classifier of x. The actual classifier of x into one of the categories 0, 1,...,c— 1 is then 


argmax gx+1(X). 


This is equivalent to the multi-logit classifier in Section 7.5. Note, however, that there we 
used a slightly different notation, with x instead of x and we have a reference class; see 
Exercise 13. Oo 


In practical implementations, the softmax function can cause numerical over- and 
under-flow errors when either one of the exp(z,) happens to be extremely large or 
> exp(zų) happens to be very small. In such cases we can exploit the invariance 
property (Exercise 1): 


softmax(z) = softmax(z +c x 1) for any constant c. 


Using this property, we can compute softmax(z) with greater numerical stability via 
softmax(z — max;{z;,} x 1). 





When neural networks are used for classification into c classes and the number of out- 
put nodes is c — 1, then the g,(x) may be viewed as nonlinear discriminant functions. 
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E Example 9.3 (Density Estimation) Estimating the density f of some random feature 
X € Ris the prototypical unsupervised learning task, which we tackled in Section 4.39 us- 
ing Gaussian mixture models. We can view a Gaussian mixture model with p; components 
and a common scale parameter 0 > 0 as a neural network with two hidden layers, similar 
to the one on Figure 9.3. In particular, if the activation function in the first hidden layer, 
Sı, is of the form (9.1) with S (z) := exp(—z*/(207))/V2707, then the density value g(x) is 
computed via: 


Z=Wixtdi, a, =S§,(Z1), 
z2 = Wra,+b2, ay = S2(22), 


g(x) = aj} a, 


where W; = 1 is a p;X1 column vector of ones, W, = O is a pı X pı matrix of zeros, and S» 
is the softmax function. We identify the column vector bı with the pı location parameters, 
[ui ... Hp] of the Gaussian mixture and b} € R”! with the pı weights of the mixture. 
Note the unusual activation function of the output layer — it requires the value of a, from 
the first hidden layer and a, from the second hidden layer. D 


There are a number of key design characteristics of a feed-forward network. First, we 
need to choose the activation function(s). Second, we need to choose the loss function for 
the training of the network. As we shall explain in the next section, the most common 
choices are the ReLU activation function and the cross-entropy loss. Crucially, we need 
to carefully construct the network architecture — the number of connections among the 
nodes in different layers and the overall number of layers of the network. 

For example, if the connections from one layer to the next are pruned (called sparse 
connectivity) and the links share the same weight values {w;;;} (called parameter sharing) 
for all {(i, j) : li — j| = 0,1,...}, then the weight matrices will be sparse and Toeplitz. 

Intuitively, the parameter sharing and sparse connectivity can speed up the training of 
the network, because there are fewer parameters to learn, and the Toeplitz structure permits 
quick computation of the matrix-vector products in Algorithm 9.2.1. An important example 
of such a network is the convolution neural network (CNN), in which some or all of the 
network layers encode the linear operation of convolution: 


Wa = W) * a1, 


where [x * y]; := Di, XkYi-k+1. AS discussed in Example A.10, a convolution matrix is a 
special type of sparse Toeplitz matrix, and its action on a vector of learning parameters can 
be evaluated quickly via the fast Fourier transform. 

CNNs are particularly suited to image processing problems, because their convolution 
layers closely mimic the neurological properties of the visual cortex. In particular, the 
cortex partitions the visual field into many small regions and assigns a group of neurons to 
every such region. Moreover, some of these groups of neurons respond only to the presence 
of particular features (for example, edges). 

This neurological property is naturally modeled via convolution layers in the neural 
network. Specifically, suppose that the input image is given by an mı X m, matrix of pixels. 
Now, define a k x k matrix (sometimes called a kernel, where k is generally taken to be 3 
or 5). Then, the convolution layer output can be calculated using the discrete convolution 
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of all possible k x k input matrix regions and the kernel matrix; (see Example A.10). In 
particular, by noting that there are (mı —k + 1) X (m — k + 1) possible regions in the original 
image, we conclude that the convolution layer output size is (mı — k + 1) x (m -k + 1). 
In practice, we frequently define several kernel matrices, giving an output layer of size 
(mı —k + 1) X (mz —k + 1) x (the number of kernels). Figure 9.4 shows a 5 x 5 input image 
and a 2 x 2 kernel with a 4 x 4 output matrix. An example of using a CNN for image 
classification is given in Section 9.5.2. 





Figure 9.4: An example 5 x5 input image and a 2 x 2 kernel. The kernel is applied to every 
2 x 2 region of the original image. 


9.3 Back-Propagation 


The training of neural networks is a major challenge that requires both ingenuity and much 
experimentation. The algorithms for training neural networks with great depth are collect- 
ively referred to as deep learning methods. One of the simplest and most effective methods 
for training is via steepest descent and its variations. 

Steepest descent requires computation of the gradient with respect to all bias vectors 
and weight matrices. Given the potentially large number of parameters (weight and bias 
terms) in a neural network, we need to find an efficient method to calculate this gradient. 

To illustrate the nature of the gradient computations, let 6 = {W,, b;} be a column vec- 
tor of length dim(@) = Se 1Pı + pı) that collects all the weight parameters (number- 
ing Sii Pı-1 Pı) and bias parameters (numbering Dii pi) of a multiple-layer network with 
training loss: 


1 n 
€:(8(-10)) = = X | Loss(yin g0: 10). 
i=1 


Writing C;(0) := Loss(y;, g(x; |0)) for short (using C for cost), we have 
1 n 
£(g(-18)) =- X CO), (9.4) 
GEE 


so that obtaining the gradient of £, requires computation of 0C;/00 for every i. For ac- 
tivation functions of the form (9.1), define D; as the diagonal matrix with the vector of 
derivatives 

S (2) = [S (Zia). +658" Erp 


down its main diagonal; that is, 


D; := diag(S’(zi1),.-.5S'Zip)), L=1,..., L-1. 
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The following theorem provides us with the formulas needed to compute the gradient of a 
typical C;(0). 


Theorem 9.2: Gradient of Training Loss 





Proof: The scalar value C is obtained from the transitions (9.2), followed by the mapping 
g(x|0) = Loss(y, g(x |0)). Using the chain rule (see Appendix B.1.2), we have rs 400 


5, = C 28 AC _ SLC 
ET OZ, 7 OZ, Og(x) J ZL dg 


Recall that the vector/vector derivative of a linear mapping z œ> Wz is given by W"; see 
(B.5). It follows that, since zı = W; a ;_; + b; and a; = S(z;), the chain rule gives Is 399 


OZ; _ ôdi- Oz) 
OzZ1-1 OzZ)-1 Oa); 





= D_,W/. 


Hence, the recursive formula (9.5): 


_ OC 7 Oz, OC 
zı  OZ)-1 OZ 





=D Wô, 1=L,...,3,2. 


Using the {6;}, we can now compute the derivatives with respect to the weight matrices 
and the biases. In particular, applying the “scalar/matrix” differentiation rule (B.10) to 
zı = Wa -; + b; gives: 


OC OC Oz; 
— = ——_ =§ ie E= eee f 
aw, gz 0Ww, 4" 
ane ðC 0z,6C 
Zl 
— = —— = 96), L= eee bP 
db, baza 


O 


From the theorem we can see that for each pair (x, y) in the training set, we can compute the 
gradient 0C/0@ in a sequential manner, by computing 67;,...,6,. This procedure is called 


back-propagation. Since back-propagation mostly involves simple matrix multiplication, it BACK- 
PROPAGATION 
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SATURATION 


can be efficiently implemented using dedicated computing hardware such as graphical pro- 
cessor units (GPUs) and other parallel computing architecture. Note also that many matrix 
computations that run in quadratic time can be replaced with linear-time componentwise 
multiplication. Specifically, multiplication of a vector with a diagonal matrix is equivalent 
to componentwise multiplication: 


A b=aob. 
SS 
diag(a) 


Consequently, we can write 6;-; = Dj; W/ ô; as: 61-1 = S’(z-1) OW) 6), L= L,...,3,2. 

We now summarize the back-propagation algorithm for the computation of a typical 
0C/09. In the following algorithm, Lines 1 to 5 are the feed-forward part of the algorithm, 
and Lines 7 to 10 are the back-propagation part of the algorithm. 


Algorithm 9.3.1: Computing the Gradient of a Typical C(@) 


input: Training example (x, y), weight matrices and bias vectors {W), bi}, =: 0, 
activation functions {S)}/_,. 
output: The derivatives with respect to all weight matrices and bias vectors. 


1 do x 
2 for/=1,...,L do // feed-forward 
3 zı — Wa) +b; 
4 | a, — Sz) 
OS, OC 
5 Or der OF 
6 2% <0 // arbitrary assignment needed to finish the loop 
7 for/=L,...,1do // back-propagation 
ôC 
a Pi 
s| A bal, 
10 Ôi e S' (z1) 0 W/ 6; 
u return A and A for all / = 1,..., L and the value g(x) — ay (if needed) 


Note that for the gradient of C(@) to exist at every point, we need the activation func- 
tions to be differentiable everywhere. This is the case, for example, for the logistic activa- 
tion function in Figure 9.2. It is not the case for the ReLU function, which is differentiable 
everywhere, except at z = 0. However, in practice, the kink of the ReLU function at z = 0 
is unlikely to trip the back-propagation algorithm, because rounding errors and the finite- 
precision computer arithmetic make it extremely unlikely that we will need to evaluate the 
ReLU at precisely z = 0. This is the reason why in Theorem 9.2 we merely required that 
C(@) is almost-everywhere differentiable. 

In spite of its kink at the origin, the ReLU has an important advantage over the logistic 
function. While the derivative of the logistic function decays exponentially fast to zero as 
we move away from the origin, a phenomenon referred to as saturation, the derivative of 
the ReLU function is always unity for positive z. Thus, for large positive z, the derivative of 
the logistic function does not carry any useful information, but the derivative of the ReLU 
can help guide a gradient optimization algorithm. The situation for the Heaviside function 
in Figure 9.2 is even worse, because its derivative is completely noninformative for any 
z + 0. In this respect, the lack of saturation of the ReLU function for z > 0 makes it a 
desirable activation function for training a network via back-propagation. 
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Finally, note that to obtain the gradient 0f,/00 of the training loss, we simply need to 
loop Algorithm 9.3.1 over all the n training examples, as follows. 


Algorithm 9.3.2: Computing the Gradient of the Training Loss 
input: Training set T = {(x;, y,)}_,, weight matrices and bias vectors 
{W,, bi}, =: 0, activation functions {S}. 
output: The gradient of the training loss. 
1 fori =1,...,ndo // loop over all training examples 


L 
2 | Run Algorithm 9.3.1 with input (x;, y;) to compute (a, ah 


ôC ~ lyn, bC ôC lyn ĉc = 
3 return zw; = 5 Diet gw, ANd 5p = 3 Èi gp, Oral = 1, L 


E Example 9.4 (Squared-Error and Cross-Entropy Loss) The back-propagation Al- 
gorithm 9.3.1 requires a formula for 6; in line 5. In particular, to execute line 5 we need to 
specify both a loss function and an Sz that defines the output layer: g(x | 0) = a, = S7(Zz). 


For instance, in the multi-logit classification of inputs y into pz categories labeled r= 266 
0,1,...,(pz — 1), the output layer is defined via the softmax function: 
exp(Z 
Sian Pe) 
zat EXPCLk) 


In other words, g(x | 0) is a probability vector such that its (y + 1)-st component g,+1 (x |0) = 
g(y|90,x) is the estimate or prediction of the true conditional probability f(y |x). Combin- 


ing the softmax output with the cross-entropy loss, as was done in (7.17), yields: rs 267 
Loss(f(y| x), 8010, x)) = — In 8010, x) 
= -ln Gyr (X 10) 


= —Zy41 + ln $4 exp(z). 


Hence, we obtain the vector 6; with components (k = 1,..., pr) 
Sik = È (=z + In DP, expa) = g l0) - My =k- 1}. 


Note that we can remove a node from the final layer of the multi-logit network, be- 
cause gı(x |8) (which corresponds to the y = 0 class) can be eliminated, using the fact 
that gı(x|0) = 1 — an g(x |0). For a numerical comparison, see Exercise 13. 

As another example, in nonlinear multi-output regression (see Example 9.1), the out- 
put function Sz is typically of the form (9.1), so that 0S;/dz = diag(S/(z1),..-,S7(Zp,))- 
Combining the output g(x |0) = $7(z,) with the squared-error loss yields: 


PL 


Loss(y, gœ 16) = lly - gœ lD = 104-841). 


jl 
Hence, line 5 in Algorithm 9.3.1 simplifies to: 


_ ôSLôðC 


ôL = a de Si (z1) © 2(g(x | 6) — y). 
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9.4 Methods for Training 


Neural networks have been studied for a long time, yet it is only recently that there have 
been sufficient computational resources to train them effectively. The training of neural 
networks requires minimization of a training loss, €,(g(-|0)) = 1 > C(O), which is typ- 
ically a difficult high-dimensional optimization problem with multiple local minima. We 
next consider a number of simple training methods. 


In this section, the vectors 6, and g, use the notation of Section B.3.2 and should not 


be confused with the derivative 6 and the prediction function g, respectively. 
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9.4.1 Steepest Descent 


If we can compute the gradient of €,(g(-|0)) via back-propagation, then we can apply the 
steepest descent algorithm, which reads as follows. Starting from a guess 6), we iterate the 
following step until convergence: 


Ort = 0, i Qi Ur, t = 1, 2,.. “9 (9.6) 


where u, := a (6,) and a, is the learning rate. 

Observe that, rather than operating directly on the weights and biases, we operate in- 
stead on 0 := {W,, bi}, — a column vector of length ZE (pmp + pı) that stores all the 
weight and bias parameters. The advantage of organizing the computations in this way is 
that we can easily compute the learning rate a,; for example, via the Barzilai-Borwein 
formula in (B.26). 


Algorithm 9.4.1: Training via Steepest Descent 

input: Training set T = {(x;, y;)}‘_,, initial weight matrices and bias vectors 
{W/, bj}, =: 91, activation functions {S/}* ,. 

output: The parameters of the trained learner. 


it—1,6<01x1u_;-0,a<- 0.1 // initialization 

2 while stopping condition is not met do 

3 compute the gradient u, = o(6,) using Algorithm 9.3.2 

4 g — U, — U.i 

5 if õ'g > 0 then // check if Hessian is positive-definite 
6 | a — õ'g/llgl? // Barzilai-Borwein 
7 else 

8 | ac2xa // failing positivity, do something heuristic 
9 ò — -a u, 

10 O1 — 0,+ ò 

11 tet+1 





12 return @, as the minimizer of the training loss 


Typically, we initialize the algorithm with small random values for 01, while being 
careful to avoid saturating the activation function. For example, in the case of the ReLU 
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activation function, we will use small positive values to ensure that its derivative is not 
zero. A zero derivative of the activation function prevents the propagation of information 
useful for computing a good search direction. 

Recall that computation of the gradient of the training loss via Algorithm 9.3.2 requires 
averaging over all training examples. When the size n of the training set T, is too large, 
computation of the gradient 0f,,/0@ via Algorithm 9.3.2 may be too costly. In such cases, 
we may employ the stochastic gradient descent algorithm. In this algorithm, we view the 
training loss as an expectation that can be approximated via Monte Carlo sampling. In 
particular, if K is a random variable with distribution P[K = k] = 1/n fork = 1,...,n, then 
we can write 


1 n 
€<((-18)) = =  Loss(yp, g(x |) = ELoss(yg, g(x 19). 
k=1 


We can thus approximate ¢,(g(-|@)) via a Monte Carlo estimator using N iid copies of K: 
a Pa 
AECID) = 2, Loss(Y x, (Xx; 10). 


The iid Monte Carlo sample K;,..., Ky is called a minibatch (see also Exercise 3). Typic- 
ally, n >> N so that the probability of observing ties in a minibatch of size N is negligible. 

Finally, note that if the learning rate of the stochastic gradient descent algorithm sat- 
isfies the conditions in (3.30), then the stochastic gradient descent algorithm is simply a 
version of the stochastic approximation Algorithm 3.4.5. 


9.4.2 Levenberg—Marquardt Method 


Since a neural network with squared-error loss is a special type of nonlinear regression 
model, it is possible to train it using classical nonlinear least-squares minimization meth- 
ods, such as the Levenberg—Marquardt algorithm. 

For simplicity of notation, suppose that the output of the net for an input x is a scalar 
g(x). For a given input parameter 0 of dimension d = dim(@), the Levenberg—Marquardt 
Algorithm B.3.3 requires computation of the following vector of outputs: 


g(T|0) := [g(x |9),..., gen OI)", 


as well as the n X d matrix of Jacobi, G, of g at 0. To compute these quantities, we can 
again use the back-propagation Algorithm 9.3.1, as follows. 


Algorithm 9.4.2: Output for Training via Levenberg—Marquardt 
input: Training set T = {(x;, y,)}/_,, parameter 0. 
output: Vector g(t |0) and matrix of Jacobi G for use in Algorithm B.3.3. 
1 fori=1,...,n do // loop over all training examples 
2 Run Algorithm 9.3.1 with input (x;, y;) (using oe = | in line 5) to compute 
g(x;| 0) and 2G! | 


3 glo) — [g(x 19),...,9(x, |)" 


gæl ôg) ]T 
Ge] E T) l 


5 return g(t |0) and G 
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The Levenberg—Marquardt algorithm is not suitable for networks with a large number 
of parameters, because the cost of the matrix computations becomes prohibitive. For in- 
stance, obtaining the Levenberg—Marquardt search direction in (B.28) usually incurs an 
O(d°) cost. In addition, the Levenberg—Marquardt algorithm is applicable only when we 
wish to train the network using the squared-error loss. Both of these shortcomings are 
mitigated to an extent with the quasi-Newton or adaptive gradient methods described next. 


9.4.3 Limited-Memory BFGS Method 


All the methods discussed so far have been first-order optimization methods, that is, meth- 
ods that only use the gradient vector u; := = (0,) at the current (and/or immediate past) can- 
didate solution 6,. In trying to design a more efficient second-order optimization method, 
we may be tempted to use Newton’s method with a search direction: 


el 
-H, Ur, 


where H, is the d x d matrix of second-order partial derivatives of £-(g(- |0)) at 4,. 

There are two problems with this approach. First, while the computation of u, via Al- 
gorithm 9.3.2 typically costs O(d), the computation of H, costs O(d?). Second, even if we 
have somehow computed H, very fast, computing the search direction H;! u, still incurs an 
O(d°) cost. Both of these considerations make Newton’s method impractical for large d. 

Instead, a practical alternative is to use a quasi-Newton method, in which we directly 
aim to approximate H7! via a matrix C, that satisfies the secant condition: 


C: gı = ð;-1, 


where 6, := O1 — 0, and g, := Uj; — U. 

An ingenious formula that generates a suitable sequence of approximating matrices 
{C,} (each satisfying the secant condition) is the BFGS updating formula (B.23), which 
can be written as the recursion (see Exercise 9): 


Cai = (I —U; gò) C, (I —U; 9,5; ) + Ut 5,5), Ur := (gò. (9.7) 


This formula allows us to update C,_, to C, and then compute C, u, in O(d’) time. While 
this quasi-Newton approach is better than the O(d*) cost of Newton’s method, it may be 
still too costly in large-scale applications. 

Instead, an approximate or limited memory BFGS updating can be achieved in O(d) 
time. The idea is to store a few of the most recent pairs {6,, g,} in order to evaluate its action 
on a vector u, without explicitly constructing and storing C, in computer memory. This is 
possible, because updating Co to C, in (9.7) requires only the pair i, g}, and similarly 
computing C, from Co only requires the history of the updates 1, g; ...,6;,g,, Which can 
be shown as follows. 

Define the matrices A,;,..., Ag via the backward recursion (j = 1,..., £): 


eh beh ia 


and observe that all matrix vector products: Aju =: q j for j = 0,...,t can be computed 
efficiently via the backward recursion starting with q, = u: 
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In addition to {q jb we will make use of the vectors {r;} defined via the recursion: 
Fo := Co qo, rj = rja tui(ti-—gj rit) 9), J= irrt: (9.9) 
At each iteration r, the BFGS updating formula (9.7) can be rewritten in the form: 
C = A CA + Ut 5,5). 
By iterating this recursion backwards to Co, we can write: 
t 
C, = Aj CoA +)” vj Aj 8,5; Aj, 
j=l 


that is, we can express C; in terms of the initial Co and the entire history of all BFGS values 
{ð;, g j) as claimed. Further, with the {q ;, r;} computed via (9.8) and (9.9), we can write: 


t 
C, u = Ag Co qo + >, vi (3;4)) Aj; 
j=l 


t 
T T. T 
= Agro + uit Ay 8; + ) ut iA; 9; 


j=2 


t 
= A; [(I- 181g; ) ro + vitið] + ujTjA; 9). 


j=2 


Hence, from the definition of the {r ;} in (9.9), we obtain 


f 
oes F: Hg 
Cu = Ayr) + ) vjT jA; 9; 


j=2 

t 
= Wh olf 
= AJr + X ujtjAjo; 

j=3 
=---=A/r,+0=7;,. 


Given Co and the history of all recent BFGS values {8;,g ae the computation of the quasi- 
Newton search direction d = —C, u can be accomplished via the recursions (9.8) and (9.9) 
as summarized in Algorithm 9.4.3. 

Note that if Co is a diagonal matrix, say the identity matrix, then Coq is cheap to 
compute and the cost of running Algorithm 9.4.3 is O(h d). Thus, for a fixed length of the 
BFGS history, the cost of the limited-memory BFGS updating grows linearly in d, making 
it a viable optimization algorithm in large-scale applications. 
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Algorithm 9.4.3: Limited-Memory BFGS Update 
input: BFGS history list {ð;, g Fa initial Co, and input u. 


output: d = -C, u, where C, = (I -vj 597) Cı (I -vj g 9; ) +u 6,5). 
1qeu 


2 fori=h,h—-1,...,1do // backward recursion to compute Aju 
-1 

3 Ui — (èrg) 

4) neðiq 

5 q = q- viTifi 

6 q4- Cq // compute Co(Ao u) 

7 fori = 1,...,h do // compute recursion (9.9) 


s | q= q+uvilti-g] q) ð: 
9 return d — —q, the value of -C, u 


In summary, a quasi-Newton algorithm with limited-memory BFGS updating reads as 
follows. 


Algorithm 9.4.4: Quasi-Newton Minimization with Limited-Memory BFGS 
input: Training set T = {(x;, y;)}_,, initial weight matrices and bias vectors 


i=l? 


{W), bi} , =: 9), activation functions {S De , and history parameter A. 
output: The parameters of the trained learner. 

it-—1,6<01x1,u,; 0 // initialization 

2 while stopping condition is not met do 

3 Compute Cyaue = €-(g(-| 6,)) and u, = *(0,) via Algorithm 9.3.2. 

4 g S U; — U.i 

5 Add (6, g) to the BFGS history as the newest BFGS pair. 

6 if the number of pairs in the BFGS history is greater than h then 

7 |i: remove the oldest pair from the BFGS history 

8 Compute d via Algorithm 9.4.3 using the BFGS history, Cp = I, and u,. 
9 acl 
10 while ,(g(-|0; + æ d)) > fane + 10-*a d'u, do 

11 a a<-a/l.5 // line-search along quasi-Newton direction 
12 6<ad 
13 641—9,+56 
14 t—t+l1 





15 return ð, as the minimizer of the training loss 


9.4.4 Adaptive Gradient Methods 


Recall that the limited-memory BFGS method in the previous section determines a search 
direction using the recent history of previously computed gradients {u,} and input paramet- 
ers {0,}. This is because the BFGS pairs {6,, g,} can be easily constructed from the identities: 
6; = 9,41 — 9, and g, = U1 —u,;. In other words, using only past gradient computations and 
with little extra computation, it is possible to infer some of the second-order information 
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contained in the Hessian matrix of £,(0). In addition to the BFGS method, there are other 
ways in which we can exploit the history of past gradient computations. 
One approach is to use the normal approximation method, in which the Hessian of £+ 


at 0, is approximated via 
h 


~ 1 
H =yI+> X, uaj, (9.10) 
i=t-h+1 
where U,_741,...,U, are the h most recently computed gradients and y is a tuning parameter 
(for example, y = 1/h). The search direction is then given by 


{J-l 
-H; u, 


which can be computed quickly in O(h? d) time either using the QR decomposition (Exer- 
cises 5 and 6), or the Sherman—Morrison Algorithm A.6.1. This approach requires that we 
store the last h gradient vectors in memory. 
Another approach that completely bypasses the need to invert a Hessian approximation 
is the Adaptive Gradient or AdaGrad method, in which we only store the diagonal of H, 
and use the search direction: 
—diag(H,)~!/7u,. 


We can avoid storing any of the gradient history by instead using the slightly different 


search direction” 
-u,| Vv, + Yy x 1, 


where the vector Vi is updated recursively via 
V, = — — |V -1 + -U O 4. 
t 1 t-1 J t t 


With this updating of v,, the difference between the vector v, + y x 1 and the diagonal of 
the Hessian H, will be negligible. 

A more sophisticated version of AdaGrad is the adaptive moment estimation or Adam 
method, in which we not only average the vectors {v,}, but also average the gradient vectors 
{u,}, as follows. 


Algorithm 9.4.5: Updating of Search Direction at Iteration t via Adam 
input: u,, U1, V;-1, 9, and parameters (a, hy, hu), equal to, e.g., (107°, 10%, 10). 
output: T;, V;, O41. 

T, e- (1- + Ti + 7 Ur 

2 v, (1 — +) v + EU, O Ur 

3 T, = T| (1 - (1 - hz’)') 

4 vi = vef (1-0 - i’) 

5 Oni — 0; — aTh (yV: + 10° x 1) 

6 return T, V, 0,41 


m 





?Here we divide two vectors componentwise. 


ns 414 


rs 373 
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MOMENTUM 
METHOD 


rs 26 


Yet another computationally cheap approach is the momentum method, in which the 
steepest descent iteration (9.6) is modified to 


6141 = 0; — a, U; + y o-1, 


where 6,_; = 0, — 9,_; and y is a tuning parameter. This strategy frequently performs better 
than the “vanilla” steepest descent method, because the search direction is less likely to 
change abruptly. 

Numerical experience suggests that the vanilla steepest-descent Algorithm 9.4.1 and 
the Levenberg—Marquardt Algorithm B.3.3 are effective for networks with shallow archi- 
tectures, but not for networks with deep architectures. In comparison, the stochastic gradi- 
ent descent method, the limited-memory BFGS Algorithm 9.4.4, or any of the adaptive 
gradient methods in this section, can frequently handle networks with many hidden lay- 
ers (provided that any tuning parameters and initialization values are carefully chosen via 
experimentation). 


9.5 Examples in Python 


In this section we provide two numerical examples in Python. In the first example, we 
train a neural network with the stochastic gradient descent method using the polynomial 
regression data from Example 2.1, and without using any specialized Python packages. 

In the second example, we consider a realistic application of a neural network to image 
recognition and classification. Here we use the specialized open-source Python package 
Pytorch. 


9.5.1 Simple Polynomial Regression 


Consider again the polynomial regression data set depicted in Figure 2.4. We use a network 
with architecture 
[Po: P1; P2, P3] = (1, 20, 20, 1]. 


In other words, we have two hidden layers with 20 neurons, resulting in a learner with a 
total of dim(@) = 481 parameters. To implement such a neural network, we first import the 
numpy and the matplotlib packages, then read the regression problem data and define 
the feed-forward neural network layers. 


NeuralNetPurePython. py 


import numpy as np 
import matplotlib.pyplot as plt 


#%% 

# import data 

data = np.genfromtxt('polyreg.csv',delimiter=',') 
X = data[:,0].reshape(-1,1) 

y = data[:,1].reshape(-1,1) 


# Network setup 
p = [X.shape[1],20,20,1] # size of layers 
L = len(p)-1 # number of layers 
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Next, the initialize method generates random initial weight matrices and bias vec- 
tors {W;, b We ,- Specifically, all parameters are initialized with values distributed according 
to the standard normal distribution. 


def initialize(p, w_sig = 1): 
W, b = [[]]*lenCp), [[]]*len(p) 


for 1 in range(1,len(p)): 
W[1l]= w_sig * np.random.randn(p[1], p[1-1]) 
b[l]= w_sig * np.random.randn(p[1l], 1) 
return W,b 


W,b = initialize(p) # initialize weight matrices and bias vectors 





The following code implements the ReLU activation function from Figure 9.2 and the 
squared error loss. Note that these functions return both the function values and the corres- 
ponding gradients. 


def RELU(z,1): # RELU activation function: value and derivative 
if == L: return z, np.ones_like(z) 
else: 
val = np.maximum(0,z) # RELU function element-wise 
J = np.array(z>0, dtype = float) # derivative of RELU 
element -wise 
return val, J 


def loss_fn(y,g): 
return (g = y)**2, 2 


RELU 





Next, we implement the feed-forward and backward-propagation Algorithm 9.3.1. 
Here, we have implemented Algorithm 9.3.2 inside the backward-propagation loop. 





def feedforward(x,W,b): 
a, Z; gr_S = [0]*(L+1), [0]*C(L+1), [O] CLI) 


a[0] = x.reshape(-1,1) 
for 1 in range(1,L+1): 
z[1] = W[1] @ a[l-1] + b[1l] # affine transformation 
a[l], gr_S[1] = S(z[1],1) # activation function 
return a, z, gr_S 


def backward(W,b,X,y): 
n =len(y) 
delta = [0]*C(L+1) 
dc_db, dC_dW = [0]*CL+1), [0]*CL+1) 
loss=0 


for i in range(n): # loop over training examples 
a, z, gr_S = feedforward(X[i,:].T, W, b) 
cost, gr_C = loss_fn(y[i], a[L]) # cost i and gradient wrt g 
loss += cost/n 
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delta[L] = gr_S[L] @ gr_Cc 


for lin cange(L 0-1: WS ee t 
dCi_dbl = delta[l1l] 
dci_dWl = delta[1l] @ a[1-1].T 


# ---- sum up over samples ---- 


dc_db[1] = dC_db[1] + dCi_dbl/n 
dC_dW[1] = dC_dW[1] + dcCi_dWl/n 


delta[l-1] = gr_S[1-1] * W[1].T @ delta[1] 


return dC_dW, dC_db, loss 





As explained in Section 9.4, it is sometimes more convenient to collect all the weight 


matrices and bias vectors {W;, b)}_, into a single vector 0. Consequently, we code two 


functions that map the weight matrices and the bias vectors into a single parameter vector, 
and vice versa. 


def list2vec(W,b): 
# converts list of weight matrices and bias vectors into 
# one column vector 
b_stack = np.vstack([b[i] for i in range(1,len(b))] ) 
W_stack = np.vstack(W[i].flatten().reshape(-1,1) for i in range 
(1, len(W))) 
vec = np.vstack([b_stack, W_stack]) 
return vec 


vec2list(vec, p): 

# converts vector to weight matrices and bias vectors 
W, b = [[]]*lenCp) ,[[]]*len(p) 

p_count = Q 


for 1 in range(1,len(p)): # construct bias vectors 
b[1] = vec[p_count: (p_count+p[1])].reshape(-1,1) 
p_count = p_count + p[l1] 


for 1 in range(1,len(p)): # construct weight matrices 
W[1] = vec[p_count:(p_count + p[1l]*p[1l-1])].reshape(p[1], p[ 
1-1]) 
p_count = p_count + (p[1l]*p[1-1]) 


return W, b 





Finally, we run the stochastic gradient descent for 10* iterations using a minibatch of 
size 20 and a constant learning rate of a, = 0.005. 





batch_size = 20 

lr = 0.005 

beta = list2vec(W,b) 
loss_arr = [] 
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n = len(X) 

num_epochs = 10000 

printC("epoch | batch loss") 

print ("---------------------------- oa) 

for epoch in range(1,num_epochs+1): 
batch_idx = np.random.choice(n,batch_size) 
batch_X = X[batch_idx].reshape(-1,1) 
batch_y=y[batch_idx].reshape(-1,1) 
dc_dW, dc_db, loss = backward(W,b,batch_X,batch_y) 
d_beta = list2vec(dC_dW ,dC_db) 
loss_arr.append(loss.flatten() [0]) 
ifCepoch==1 or np.mod(epoch, 1000) ==90): 

printCepoch,": ",loss.flatten() [0]) 

beta = beta - Ir*d_beta 
W,b = vec2list (beta,p) 


# calculate the loss of the entire training set 

dc_dW, dC_db, loss = backward(W,b,X,y) 

print("entire training set loss = ",loss.flatten() [0]) 
xx = np.arange(0,1,0.01) 

y_preds = np.zeros_like(xx) 


for i in range(len(xx)): 


a, _, — = feedforward(xx[i],W,b) 

y_preds[i], = a[L] 
plt.plot(X,y, 'r.', markersize = 4,label = 'y') 
plt.plot(np.array(xx), y_preds, 'b',label = 'fit') 


plt.legend() 

plt.xlabel('x') 

plt.ylabelC'y') 

plt.show() 
plt.plot(np.array(loss_arr), 'b') 
plt.xlabel(C'iteration') 
plt.ylabel('Training Loss") 
plt.show() 


epoch batch loss 


158.6779278688539 

$ 54.52430507401445 
2000 : 38.346572088604965 
3000 : 31.02036319180713 
4000 : 22.91114276931535 
5000 : 27.75810262906341 
6000 : 22.296907007032928 
7000 : 17.337367420038046 
8000 : 19.233689945334195 
9000 : 39.54261478969857 
10000 : 14.754724387604416 
entire training set loss = 28.904957963612727 





The left panel of Figure 9.5 shows a trained neural network with a training loss of 
approximately 28.9. As seen from the right panel of Figure 9.5, the algorithm initially 
makes rapid progress until it settles down into a stationary regime after 400 iterations. 
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Figure 9.5: Left panel: The fitted neural network with training loss of €,(g,) ~ 28.9. Right 
panel: The evolution of the estimated loss, €,(g,(- |@)), over the steepest-descent iterations. 


9.5.2 Image Classification 


In this section, we will use the package Pytorch, which is an open-source machine learn- 
ing library for Python. Pytorch can easily exploit any graphics processing unit (GPU) 
for accelerated computation. As an example, we consider the Fashion-MNIST data set 
from https://www.kaggle.com/zalando-research/fashionmnist. The Fashion- 
MNIST data set contains 28 x 28 gray-scale images of clothing. Our task is to classify 
each image according to its label. Specifically, the labels are: T-Shirt, Trouser, Pullover, 
Dress, Coat, Sandal, Shirt, Sneaker, Bag, and Ankle Boot. Figure 9.6 depicts a typical 
ankle boot in the left panel and a typical dress in the right panel. To start with, we import 
the required libraries and load the Fashion-MNIST data set. 





Figure 9.6: Left: an ankle boot. Right: a dress. 





ImageClassificationPytorch. py 


import torch 

import torch.nn as nn 

from torch.autograd import Variable 
import pandas as pd 

import numpy as np 

import matplotlib.pyplot as plt 
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from torch.utils.data import Dataset, DataLoader 
from PIL import Image 
import torch.nn.functional as F 
HHEFHHEEHHEFHHEAHHEEHHEEHHEAHAEEHEAHHEEHEEHHEEHHERHAEHHEARHEEH REAR HE 
# data loader class 
HHEFHHEEHHEFHHEAHHEEHHEEHHEAHAEEHHEAAHEEHEEHHEEHHRERHAEHHEAHHREH REE HHE 
class LoadData(Dataset): 
def __init__Cself, fName, transform=None): 
data = pd.read_csv(fName) 
self.X = np.array(data.iloc[:, 1:], dtype=np.uint8).reshape 
(-1, 1, 28, 28) 
self.y = np.array(data.iloc[:, 0]) 


def __len__(self): 
return len(self.X) 


def __getitem__(Cself, idx): 
img = self.X[idx] 
lbl = self.y[idx] 
return (img, lbl) 


# load the image data 
train_ds = LoadData('fashionmnist/fashion-mnist_train.csv') 


test_ds = LoadData('fashionmnist/fashion-mnist_test.csv') 


# set labels dictionary 


lapels = {0 : T-Shirt, 1 : trouser’, 2 : "Pullover", 
3 r Dress? 4 » “Coat. 5 2) “Sandal. 6: Shire” 5 
7 : 'Sneaker', 8 : 'Bag', 9 : ‘Ankle Boot'} 





Since an image input data is generally memory intensive, it is important to partition 
the data set into (mini-)batches. The above code defines a batch size of 100 images and 
initializes the Pytorch data loader objects. These objects will be used for efficient iteration 
over the data set. 


# load the data in batches 
batch_size = 100 


train_loader = torch.utils.data.DataLoader(dataset=train_ds, 


batch_size=batch_size, 
shuffle=True) 
test_loader = torch.utils.data.DataLoader(dataset=test_ds, 
batch_size=batch_size, 
shuffle=True) 





Next, to define the network architecture in Pytorch all we need to do is define an 
instance of the torch.nn. Module class. Choosing a network architecture with good gen- 
eralization properties can be a difficult task. Here, we use a network with two convolution 
layers (defined in the cnn_layer block), a 3 x3 kernel, and three hidden layers (defined in 
the flat_layer block). Since there are ten possible output labels, the output layer has ten 
nodes. More specifically, the first and the second convolution layers have 16 and 32 output 
channels. Combining this with the definition of the 3 x 3 kernel, we conclude that the size 
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ts 267 





of the first flat hidden layer should be: 


second convolution layer 2 


p ŘŘÁĖŘ_ 
(28=3+1) -3+1| x32 = 18432, 
— —<— 


first convolution layer 


where the multiplication by 32 follows from the fact that the second convolution layer has 
32 output channels. Having said that, the flat_fts variable determines the number of 
output layers of the convolution block. This number is used to define the size of the first 
hidden layer of the flat_layer block. The rest of the hidden layers have 100 neurons and 
we use the ReLU activation function for all layers. Finally, note that the forward method 
in the CNN class implements the forward pass. 


# define the network 
class CNN(nn.Module): 


def __init__(Cself): 
super(CNN, self).__init__(Q) 


self.cnn_layer = nn.Sequential( 
nn.Conv2d(1, 16, kernel_size=3, stride=(1,1)), 
nn.ReLU(), 
nn.Conv2d(16, 32, kernel_size=3, stride=(1,1)), 
nn.ReLUQ, 

) 

self.flat_fts = ((C(28-3+1) -3+1) **2) *32 


self.flat_layer = nn. Sequential ( 

nn.Linear(self.flat_fts, 100), 
nn.ReLU(), 

nn.Linear(100, 100), 
nn.ReLU(), 

nn.Linear(100, 100), 
nn.ReLU(), 

nn.Linear(100, 10)) 


def forward(self, x): 
out = self.cnn_layer (x) 
out = out.view(-1, self.flat_fts) 
out = self.flat_layer (out) 
return out 


Next, we specify how the network will be trained. We choose the device type, namely, 
the central processing unit (CPU) or the GPU (if available), the number of training itera- 
tions (epochs), and the learning rate. Then, we create an instance of the proposed convolu- 
tion network and send it to the predefined device (CPU or GPU). Note how easily one can 
switch between the CPU or the GPU without major changes to the code. 

In addition to the specifications above, we need to choose an appropriate loss function 
and training algorithm. Here, we use the cross-entropy loss and the Adam adaptive gradi- 
ent Algorithm 9.4.5. Once these parameters are set, the learning proceeds to evaluate the 
gradient of the loss function via the back-propagation algorithm. 
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# learning parameters 
num_epochs = 50 
learning_rate = 0.001 


#device = torch.device ('cpu') # use this to run on CPU 
device = torch.device ('cuda') # use this to run on GPU 


#instance of the Conv Net 
cnn = CNN() 
cnn.to(device=device) 


#loss function and optimizer 
criterion = nn.CrossEntropyLoss() 
optimizer = torch.optim.Adam(cnn.parameters(), lr=learning_rate) 


# the learning loop 
losses = [] 
for epoch in range(1,num_epochs+1): 
for i, Cimages, labels) in enumerate(train_loader): 
images = Variable(images.float()).to(device=device) 
labels = Variable(labels).to(device=device) 


optimizer. zero_grad() 

outputs = cnn(images) 

loss = criterion(outputs, labels) 
loss. backward () 

optimizer.step() 


losses.append(loss.item() ) 
ifCepoch==1 or epoch % 10 == 0): 


print ("Epoch : , epoch, , Training Loss: , loss.itemQ) 


# evaluate on the test set 
cnn. eval () 
correct = 0 
total = 0 
for images, labels in test_loader: 
images = Variable(images.float()).to(device=device) 
outputs = cnn(images) 
_, predicted = torch.max(outputs.data, 1) 
total += labels.size(0) 
correct += (predicted.cpu() == labels).sum() 
print("Test Accuracy of the model on the 10,000 training test images 
", €100 * correct.item() / total) ,"%") 


# plot 

plt.rcC'text', usetex=True) 
plt.rcC'font', family='serif',size=20) 
plt.tight_layout () 


plt.plot(np.array (losses) [10:len(Closses) ]) 
plt.xlabel(r'{iteration}',fontsize=20) 
plt.ylabel(r'{Batch Loss}', fontsize=20) 
plt.subplots_adjust (top=0. 8) 

plt.show() 





348 


Exercises 





Epoch : 1 , Training Loss: Q.412550151348114 

Epoch : 10 Training Loss: -05452106520533562 

Epoch : 20 Training Loss: -07233225554227829 

Epoch : 30 Training Loss: .01696968264877796 

Epoch : 40 Training Loss: -0008199119474738836 

Epoch : 50 Training Loss: 0.006860652007162571 

Test Accuracy of the model on the 10,000 training test images: 91.02 % 


0 
0 
0 
0 





Finally, we evaluate the network performance using the test data set. A typical mini- 
batch loss as a function of iteration is shown in Figure 9.7 and the proposed neural network 
achieves about 91% accuracy on the test set. 
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Figure 9.7: The batch loss history. 


Further Reading 


A popular book written by some of the pioneers of deep learning is [53]. For an excellent 
and gentle introduction to the intuition behind neural networks, we recommend [94]. A 
summary of many effective gradient descent methods for training of deep networks is given 
in [105]. An early resource on the limited-memory BFGS method is [81], and a more recent 
resource includes [13], which makes recommendations on the best choice for the length of 
the BFGS history (that is, the value of the parameter /). 


Exercises 


1. Show that the softmax function 


exp(Z) 


softmax : z => ————— 
Dix EXp(Zx) 


satisfies the invariance property: 


softmax(z) = softmax(z +c x 1), for any constant c. 
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2. Projection pursuit is a network with one hidden layer that can be written as: PROJECTION 
PURSUIT 
g(x) = S(w'x), 
where § is a univariate smoothing cubic spline. If we use squared-error loss with T, = ne 235 
{yi, x;}_,, we need to minimize the training loss: 
1x Sa 
= Š Oi- S(@"x))) 
n4 
i=1 
with respect to w and all cubic smoothing splines. This training of the network is typically 
tackled iteratively in a manner similar to the EM algorithm. In particular, we iterate (t = rs 139 
1,2,...) the following steps until convergence. 
(a) Given the missing data w,, compute the spline S, by training a cubic smoothing spline 
on {y;, w; x;}. The smoothing coefficient of the spline may be determined as part of 
this step. 
(b) Given the spline function S,, compute the next projection vector w1 via iterative 
reweighted least squares: TO 
REWEIGHTED 


wnı = argmin (e, — XB)’ E, (e, — XB), (9.11) 
B 
where 
+. vN-SH@/x) 
i := ic. oS a reer 
eri = W, XxX Sworn) i n 

is the adjusted response, and pa ? diag(S/(w/xX1),...,5/(@; X,)) is a diagonal mat- 
rix. 


Apply Taylor’s Theorem B.1 to the function S, and derive the iterative reweighted 
least-squares optimization program (9.11). 


3. Suppose that in the stochastic gradient descent method we wish to repeatedly draw 
minibatches of size N from T,, where we assume that N x m = n for some large integer m. 
Instead of repeatedly resampling from T,,, an alternative is to reshuffle tT, via a random per- 
mutation II and then advance sequentially through the reshuffled training set to construct 
m non-overlapping minibatches. A single traversal of such a reshuffled training set is called 
an epoch. The following pseudo-code describes the procedure. 


LEAST SQUARES 


rs 400 


ms 335 


ms 115 


EPOCH 
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rs 340 


ns 42 


ts 375 


Algorithm 9.5.1: Stochastic Gradient Descent with Reshuffling 
input: Training set T, = {(x;, y;)}/_,, initial weight matrices and bias vectors 
{W,, bj}, > 01, activation functions {S;}/_,, learning rates {a),@,...}. 
output: The parameters of the trained learner. 
1 t 1 and epoch « 0 
while stopping condition is not met do 
Draw U;,...,U, X UO, 1). 
Let II be the permutation of {1,...,} that satisfies Un, < -++ < Un,. 
(Xi, y) — (Xn, Yn,) fori=1,...,n // reshuffle T, 
for j= 1,...,mdo 
br E L Xira Loss(y;, g(x; | 0)) 
O41 — 0,- a, (6,) 
t—t+l 


yy A un & Q N 


Co œ 





10 | epoch e epoch + 1 // number of reshuffles or epochs 


11 return @, as the minimizer of the training loss 


Write Python code that implements the stochastic gradient descent with data reshuffling, 
and use it to train the neural net in Section 9.5.1. 


4. Denote the pdf of the N(0, X) distribution by yy(-), and let 


Pro (X — Ho) 


Duo, & oz) = f = )In 
0 0 1 1 wa Px Ho py, (x — p) 


be the Kullback—Leibler divergence between the densities of the N(up, Xo) and N(w,, X1) 
distributions on R. Show that 


2D (uy, Lo | fy, Z1) = tr(£7'Eo) — In [E7 Eol + (My — Ho) ET "Gey — Ho) - d. 
Hence, deduce the formula in (B.22). 
5. Suppose that we wish to compute the inverse and log-determinant of the matrix 
I, + UU’, 
where U is an n x h matrix with h « n. Show that 
d, + UU = I,- QnQ; 


where Q, contains the first n rows of the (n + h) X h matrix Q in the QR factorization of 
the (n + h) x h matrix: 

U 

H = QR. 


In addition, show that In|I,, + UU"| = ye In re where {7;;} are the diagonal elements of 
the h x h matrix R. 
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6. Suppose that 
U = [uo, u1, ...,Up-1], 


where all u € R” are column vectors and we have computed (I,, + UUT)! via the QR 
factorization method in Exercise 5. If the columns of matrix U are updated to 


[ui,...,Un-1, Un], 


show that the inverse (I,, + UUT)! can be updated in O(hn) time (rather than computed 
from scratch in O(h? n) time). Deduce that the computing cost of updating the Hessian 
approximation (9.10) is the same as that for the limited-memory BFGS Algorithm 9.4.3. 


In your solution you may use the following facts from [29]. Suppose we are given the 
Q and R factors in the QR factorization of a matrix A € R”. If a row/column is added to 
matrix A, then the Q and R factors need not be recomputed from scratch (in O(h? n) time), 
but can be updated efficiently in O(hn) time. Similarly, if a row/column is removed from 
matrix A, then the Q and R factors can be updated in O(h?) time. 


7. Suppose that U € R” has its k-th column v replaced with w, giving the updated U. 


(a) Ife € R’ denotes the unit-length vector such that e; = |le|| = 1 and 


2 
P= Yuw-») + 2 hw —vIP vl e+ v2 


4 eae 
show that o 
U'U =U'U+r,ri -rr!. 


[Hint: You may find Exercise 16 in Chapter 6 useful. ] 
(b) Let B := (I, + U'U)"!. Use the Woodbury identity (A.15) to show that 
q, + 00 =1, -U(B' + r,r] - r-r!) T". 


(c) Suppose that we have stored B in computer memory. Use Algorithm 6.8.1 and parts 
(a) and (b) to write pseudo-code that updates (1,+UUT)™! to (1,+UUT)! in O((n+h)h) 
computing time. 


8. Equation (9.7) gives the rank-two BFGS update of the inverse Hessian C, to C,,;. In- 
stead of using a two-rank update, we can consider a one-rank update, in which C; is updated 
to Cı by the general rank-one formula: 


nane T 
Cra = C,+u,nr, . 


Find values for the scalar v, and vector r,, such that C,,, satisfies the secant condition 
Caig, = 6,. 


9. Show that the BFGS formula (B.23) can be written as: 
C e (I-ugd") C(I- gd") + 088", 


where v := (g"8)"!. 


ns 247 
tS 371 
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rs 213 


BS 328 


ts 267 
mS 333 


ns 412 


10. Show that the BFGS formula (B.23) is the solution to the constrained optimization 
problem: 
Cprcs = argmin DO, C l 0, A), 
A subject to Ag = ĝ, A = AT 
where D is the Kullback—Leibler discrepancy defined in (B.22). On the other hand, show 
that the DFP formula (B.24) is the solution to the constrained optimization problem: 


Cprp = argmin DO, A|0,C). 


A subject to Ag = ĝ, A= AT 


11. Consider again the logistic regression model in Exercise 5.18, which used iterative 
reweighted least squares for training the learner. Repeat all the computations, but this 
time using the limited-memory BFGS Algorithm 9.4.4. Which training algorithm converges 
faster to the optimal solution? 


12. Download the seeds_dataset.txt data set from the book’s GitHub site, which con- 
tains 210 independent examples. The categorical output (response) here is the type of wheat 
grain: Kama, Rosa, and Canadian (encoded as 1, 2, and 3), so that c = 3. The seven con- 
tinuous features (explanatory variables) are measurements of the geometrical properties of 
the grain (area, perimeter, compactness, length, width, asymmetry coefficient, and length 
of kernel groove). Thus, x € R’ (which does not include the constant feature 1) and the 
multi-logit pre-classifier in Example 9.2 can be written as g(x) = softmax(Wx + b), where 
W e R? and b € R?. Implement and train this pre-classifier on the first n = 105 examples 
of the seeds data set using, for example, Algorithm 9.4.1. Use the remaining n’ = 105 
examples in the data set to estimate the generalization risk of the learner using the cross- 
entropy loss. [Hint: Use the cross-entropy loss formulas from Example 9.4. ] 


13. In Exercise 12 above, we train the multi-logit classifier using a weight matrix W € R?” 
and bias vector b € R°. Repeat the training of the multi-logit model, but this time keeping z1 
as an arbitrary constant (say zı = 0), and thus setting c = 0 to be a “reference” class. This 
has the effect of removing a node from the output layer of the network, giving a weight 
matrix W € R?” and bias vector b € R? of smaller dimensions than in (7.16). 


14. Consider again Example 9.4, where we used a softmax output function Sz in con- 
junction with the cross-entropy loss: C(@) = —1n g8,+1(x | 0). Find formulas for ge and 22 


A z Oz" 
Hence, verify that: 
OS, OC 


OZ1 Og 
where e; is the unit length vector with an entry of 1 in the i-th position. 


= g(x|0) — e+, 


15. Derive the formula (B.25) for a diagonal Hessian update in a quasi-Newton method 
for minimization. In other words, given a current minimizer x, of f(x), a diagonal matrix 
C of approximating the Hessian of f, and a gradient vector u = V/f(x,), find the solution 
to the constrained optimization program: 


min D(x, C |x, — Au, A) 
subject to: Ag > ð, A is diagonal, 
where D is the Kullback—Leibler distance defined in (B.22) (see Exercise 4). 
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16. Consider again the Python implementation of the polynomial regression in Sec- 
tion 9.5.1, where the stochastic gradient descent was used for training. 


Using the polynomial regression data set, implement and run the following four altern- 
ative training methods: 


(a) the steepest-descent Algorithm 9.4.1; 


(b) the Levenberg—Marquardt Algorithm B.3.3, in conjunction with Algorithm 9.4.2 for 
computing the matrix of Jacobi; 


(c) the limited-memory BFGS Algorithm 9.4.4; 


(d) the Adam Algorithm 9.4.5, which uses past gradient values to determine the next 
search direction. 


For each training algorithm, using trial and error, tune any algorithmic parameters so that 
the network training is as fast as possible. Comment on the relative advantages and disad- 
vantages of each training/optimization method. For example, comment on which optimiz- 
ation method makes rapid initial progress, but gets trapped in a suboptimal solution, and 
which method is slower, but more consistent in finding good optima. 


17. Consider again the Pytorch code in Section 9.5.2. Repeat all the computations, but 
this time using the momentum method for training of the network. Comment on which 
method is preferable: the momentum or the Adam method? 


S 415 
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APPENDIX A 





LINEAR ALGEBRA AND FUNCTIONAL 
ANALYSIS 





The purpose of this appendix is to review some important topics in linear algebra 
and functional analysis. We assume that the reader has some familiarity with matrix 
and vector operations, including matrix multiplication and the computation of determ- 
inants. 


A.1 Vector Spaces, Bases, and Matrices 


Linear algebra is the study of vector spaces and linear mappings. Vectors are, by defini- 
tion, elements of some vector space V and satisfy the usual rules of addition and scalar 
multiplication, e.g., 


if x € Vandy € Y, then ax + By € V foralla,B ER (or ©). 


We will be dealing mostly with vectors in the Euclidean vector space R” for some n. That 
is, we view the points of R” as objects that can be added up and multiplied with a scalar, 
e.g., (x1, X2) + (Y1;y2) = (xı + y1, X2 + y2) for points in R?. Sometimes it is convenient to 
work with the complex vector space C” instead of R”; see also Section A.3. 

Vectors v1, ...,Vg are called linearly independent if none of them can be expressed as 
a linear combination of the others; that is, if œv +--- +a@,v, = 0, then it must hold that 
a, = Oforalli=1,...,n. 


Definition A.1: Basis of a Vector Space 


A set of vectors $ = {v,,...,¥,} is called a basis of the vector space V if every 
vector x € V can be written as a unique linear combination of the vectors in $: 


X= QV, +: +V: 


The (possibly infinite) number n is called the dimension of V. 
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Using a basis 8 of V, we can thus represent each vector x € V as a row or column of 
numbers 
a 
[a1,..., Ay] or >|. (A.1) 


Qn 


STANDARD BASIS. Typically, vectors in R” are represented via the standard basis, consisting of unit 
vectors (points) e; = (1,0,...,0),...,e, = (0,0,...,0, 1). As a consequence, any point 
(X1,...,X,) E R” can be represented, using the standard basis, as a row or column vec- 
tor of the form (A.1) above, with a; = x;,i = 1,...,n. We will also write [x),x2,...,X,]', 
TRANSPOSE for the corresponding column vector, where ' denotes the transpose. 


To avoid confusion, we will use the convention from now on that a generic vector x 
A is always represented via the standard basis as a column vector. The corresponding 


row vector is denoted by x’. 





A matrix can be viewed as an array of m rows and n columns that defines a linear 


por transformation from R” to R” (or for complex matrices, from C” to C”). The matrix is said 

TRANSFORMATION to be square if m = n. If a), a2,...,a„ are the columns of A, that is, A = [a1, @2,..., an], 
and if x = [x1,..., Xn] , then Ax = x, a, +--+ + x, an. In particular, the standard basis 
vector €x is mapped to the vector az, k = 1,...,n. We sometimes use the notation A = [a;;], 
to denote a matrix whose (i, j)-th element is a;;. When we wish to emphasize that a matrix 

RANK A is real-valued with m rows and n columns, we write A € R”. The rank of a matrix is the 
number of linearly independent rows or, equivalently, the number of linearly independent 
columns. 


E Example A.1 (Linear Transformation) Take the matrix 


1 à 
ala a 


It transforms the two basis vectors [1,0]' and [0, 1]", shown in red and blue in the left panel 
of Figure A.1, to the vectors [1, —0.5]" and [1,-—2]', shown on the right panel. Similarly, 
the points on the unit circle are transformed to an ellipse. 


3r 

















“4 -0.5 0 0.5 1 -2 0 2 


x T 


Figure A.1: A linear transformation of the unit circle. E 
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Suppose A = [a,,...,a,], where the A = {a;} form a basis of R”. Take any vector x = 
[iess Kelle with respect to the standard basis & (we write subscript & to stress this). Then 
the representation of this vector with respect to A is simply 


y =A''x, 


where A`! is the inverse of A; that is, the matrix such that AA~! = A7!A = I, where 
I, is the n-dimensional identity matrix. To see this, note that A~'a; gives the i-th unit 
vector representation, for i = 1,...,”, and recall that each vector in R” is a unique linear 
combination of these basis vectors. 


E Example A.2 (Basis Representation) Consider the matrix 


|l 2 ok -1 _ |72 1 
a=; d with inverse A a E (A.2) 


The vector x = [1, 1] in the standard basis has representation y = A7'x = [-1, 1];, in the 
basis consisting of the columns of A. Namely, 


, [2 
4 








The transpose of a matrix A = [a;;] is the matrix A‘ = [a;;]; that is, the (7, j)-th element 
of A" is the (j, i)-th element of A. The trace of a square matrix is the sum of its diagonal 
elements. A useful result is the following cyclic property. 


Theorem A.1: Cyclic Property 





Proof: It suffices to show that tr(DE) is equal to tr(ED) for any m x n matrix D = [d;;] 


and n x m matrix E = [e;;]. The diagonal elements of DE are Xù- dijeji,i = 1,...,m and 
the diagonal elements of ED are })'" | ej; dij, j = 1,...,n. They sum up to the same number 
Da viet dije ji- Oo 


A square matrix has an inverse if and only if its columns (or rows) are linearly in- 
dependent. This is the same as the matrix being of full rank; that is, its rank is equal to 
the number of columns. An equivalent statement is that its determinant is not zero. The 
determinant of ann X n matrix A = [q,,;] is defined as 


det(A) := X C1 | [ani (A.3) 
m i=1 


where the sum is over all permutations m = (m,...,7,) of (1,..., 7n), and ¢(7) is the num- 
ber of pairs (i, j) for which i < j and 7; > 7;. For example, ¢(2, 3,4,1) = 3 for the pairs 
(1,4), (2, 4), (3,4). The determinant of a diagonal matrix — a matrix with only zero ele- 
ments off the diagonal — is simply the product of its diagonal elements. 


INVERSE 


TRANSPOSE 
TRACE 


DETERMINANT 
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Geometrically, the determinant of a square matrix A = [a@),...,a@,] is the (signed) 
volume of the parallelepiped (n-dimensional parallelogram) defined by the columns 
a\,..., dn; that is, the set of points x = };_; a; a;, where 0 < a; < 1,i=1,...,n. 

The easiest way to compute a determinant of a general matrix is to apply simple op- 
erations to the matrix that potentially reduce its complexity (as in the number of non-zero 
elements, for example), while retaining its determinant: 


e Adding a multiple of one column (or row) to another, does not change the determin- 
ant. 


e Multiplying a column (or row) with a number multiplies the determinant by the same 
number. 


e Swapping two rows changes the sign of the determinant. 


By applying these rules repeatedly one can reduce any matrix to a diagonal matrix. 
It follows then that the determinant of the original matrix is equal to the product of the 
diagonal elements of the resulting diagonal matrix multiplied by a known constant. 


E Example A.3 (Determinant and Volume) Figure A.2 illustrates how the determinant 
of a matrix can be viewed as a signed volume, which can be computed by repeatedly apply- 
ing the first rule above. Here, we wish to compute the area of red parallelogram determined 
by the matrix A given in (A.2). In particular, the corner points of the parallelogram corres- 
pond to the vectors [0, 0]", [1,3]',[2,4]", and [3,7]. 





8r 











0 0.5 1 1.5 2 2.5 3 


Figure A.2: The volume of the red parallelogram can be obtained by a number of shear 
operations that do not change the volume. 


Adding —2 times the first column of A to the second column gives the matrix 


1 0 
nbs 2) 
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corresponding to the blue parallelogram. The linear operation that transforms the red to the 
blue parallelogram can be thought of as a succession of two linear transformations. The 
first is to transform the coordinates of points on the red parallelogram (in standard basis) 
to the basis formed by the columns of A. Second, relative to this new basis, we apply the 
matrix B above. Note that the input of this matrix is with respect to the new basis, whereas 
the output is with respect to the standard basis. The matrix for the combined operation is 


now 
a [h oļf-2 1 2i 
BA” =|; i be re i}: 


which maps [1,3]' to [1,3]' (does not change) and [2,4]' to [0,-2]'. We say that we 
apply a shear in the direction [1,3]'. The significance of such an operation is that a shear 
does not alter the volume of the parallelogram. The second (blue) parallelogram has an 
easier form, because one of the sides is parallel to the y-axis. By applying another shear, 
in the direction [0, —2]', we can obtain a simple (green) rectangle, whose volume is 2. In 
matrix terms, we add 3/2 times the second column of B to the first column of B, to obtain 


the matrix 
1 0 
c=f SI 
which is a diagonal matrix, whose determinant is —2, corresponding to the volume 2 of all 
the parallelograms. E 


Theorem A.2 summarizes a number of useful matrix rules for the concepts that we have 
discussed so far. We leave the proofs, which typically involves “writing out” the equations, 
as an exercise for the reader; see also [116]. 


Theorem A.2: Useful Matrix Rules 





Next, consider an n x p matrix A for which the matrix inverse fails to exist. That is, A 
is either non-square (n + p) or its determinant is 0. Instead of the inverse, we can use its 
so-called pseudo-inverse, which always exists. 


SHEAR 
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Moore—PENROSE 
PSEUDO-INVERSE 


LEFT 
PSEUDO-INVERSE 


RIGHT 
PSEUDO-INVERSE 


INNER PRODUCT 


ORTHOGONAL 
EUCLIDEAN NORM 


Definition A.2: Moore—Penrose Pseudo-Inverse 


The Moore—Penrose pseudo-inverse of a real matrix A € R”? is defined as the 
unique matrix A* € RP” that satisfies the conditions: 


1. AAA =A 
2. AtAAt = At 
3. (AA*)T = AAt 





4. (AtA)T = A*A 





We can write A* explicitly in terms of A when A has a full column or row rank. For 
example, we always have 


A'AA* = A'(AA*)' = (AA*)A)' = (A)' = Al, (A.4) 


If A has a full column rank p, then (ATA)! exists, so that from (A.4) it follows that 
A* = (ATA) 'A". This is referred to as the left pseudo-inverse, as A*A = I,. Similarly, if 
A has a full row rank n, that is, (AA™)~! exists, then it follows from 

A*AA' = (AtA) A" = (A(A*A))' = AT 


that A* = AT(AATY!. This is the right pseudo-inverse, as AA* = I,,. Finally, if A is of full 
rank and square, then A* = A™!. 


A.2 Inner Product 


The (Euclidean) inner product of two real vectors x = [x,,...,X,]' and y = [yy,...,ynl 
is defined as the number 


n 


(x,y) = X yi =x'y. 


i=l 
Here x'y is the matrix multiplication of the (1 x n) matrix x" and the (n x 1) matrix y. 
The inner product induces a geometry on the linear space R”, allowing for the definition of 
length, angle, and so on. The inner product satisfies the following properties: 


1. (ax + By, z) = a(x, z) + BLY, z); 
2. (x,y) = Q, x); 

3. (x, x) > 0; 

4. (x,x) = 0 if and only if x = 0. 


Vectors x and y are called perpendicular (or orthogonal) if (x,y) = 0. The Euclidean 
norm (or length) of a vector x is defined as 


lxll = fat +--+ +x = y(x, x). 
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If x and y are perpendicular, then Pythagoras’ theorem holds: PYTHAGORAS’ 
THEOREM 

lx + yl? = (x +y, x +y) = (x, x) +2 (x,y) + y, y) = lix? + Iyl’. (A.5) 

A basis {v;,...,¥,} of R” in which all the vectors are pairwise perpendicular and have 
norm | is called an orthonormal (short for orthogonal and normalized) basis. For example, ORTHONORMAL 


the standard basis is orthonormal. 


Theorem A.3: Orthonormal Basis Representation 























Proof: Observe that, because the {v;} form a basis, there exist unique a),...,@, such that 
X = QV, +--+ + Q,Vn. By the linearity of the inner product and the orthonormality of the 
{vi} it follows that (x, vj) = (3); @iVi, Vj) = Qj. o 
An n x n matrix V whose columns form an orthonormal basis is called an orthogonal 
matrix.! Note that for an orthogonal matrix V = [v1,..., Vn], we have ORTHOGONAL 
MATRIX 
vi z 
yt ViVi Viv2 ... ViVa 
VV =|? | iv... ml =]: E s © Jah. 
a VV, Vivo «6. ViVa 
Va 
Hence, V7! = V7. Note also that an orthogonal transformation is length preserving; that LENGTH 


is, Vx has the same length as x. This follows from PRESERVING 


Vxl? = (Vx, Vx) =x V' Vx =x'x = ||xl. 


A.3 Complex Vectors and Matrices 


Instead of the vector space R” of n-dimensional real vectors, it is sometimes useful to 

consider the vector space C” of n-dimensional complex vectors. In this case the adjoint ADJOINT 
or conjugate transpose operation (*) replaces the transpose operation (T). This involves 

the usual transposition of the matrix or vector with the additional step that any complex 

number z = x + iy is replaced by its complex conjugate Z = x — iy. For example, if 


and A= 


ai +ib; 
a, +ib2z 





ay, + i bii di2 + ibiz 
ay, +ib) an +ibn f 


then 


x* = [a; — ibi, a — iby] and A*= f -ibn an- el 


A412 - iby ax27 — 1bo7 





'The qualifier “orthogonal” for such matrices has been fixed by history. A better term would have been 
“orthonormal”. 
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The (Euclidean) inner product of x and y (viewed as column vectors) is now defined as 
(x,y) = yx = > Xi Ji 
i=1 
which is no longer symmetric: (x,y) = (y, x). Note that this generalizes the real-valued 
inner product. The determinant of a complex matrix A is defined exactly as in (A.3). As a 
consequence, det(A*) = det(A). 
A complex matrix is said to be Hermitian or self-adjoint if A* = A, and unitary if 
HERMITIAN P a eee L1 i n RO 7 oo 
UNARI A*A = I (that is, if A* = A`). For real matrices “Hermitian” is the same as “symmetric”, 


LINEAR SUBSPACE 


ORTHOGONAL 
COMPLEMENT 


ORTHOGONAL 
PROJECTION 
MATRIX 


and “unitary” is the same as “orthogonal”. 


A.4 Orthogonal Projections 


Let {u,,...,u,} be a set of linearly independent vectors in R”. The set 
V = Span {u,,..., Ux} = {QU +--+ + kluk, Q)1,...,a% E R}, 
is called the linear subspace spanned by {u,,...,u x}. The orthogonal complement of VY, 


denoted by V+, is the set of all vectors w that are orthogonal to V, in the sense that 
(w,v) = 0 for all yv e V. The matrix P such that Px = x, for all x € V, and Px = 0, for all 
x € V+ is called the orthogonal projection matrix onto V. Suppose that U = [u,..., ux] 
has full rank, in which case U'U is an invertible matrix. The orthogonal projection matrix 
P onto V = Span {u;,..., ug} is then given by 


P=U(U'U)'U'. 


Namely, since PU = U, the matrix P projects any vector in V onto itself. Moreover, P 
projects any vector in V+ onto the zero vector. Using the pseudo-inverse, it is possible to 
specify the projection matrix also for the case where U is not of full rank, leading to the 
following theorem. 


Theorem A.4: Orthogonal Projection 





Proof: By Property 1 of Definition A.2 we have PU = UU'U = U, so that P projects any 
vector in V onto itself. Moreover, P projects any vector in V+ onto the zero vector. Oo 


Note that in the special case where w,,...,W, above form an orthonormal basis of V, 
then the projection onto VY is very simple to describe, namely we have 


k 
Px = UU'x = Yiu) u;. (A.8) 


i=1 


Appendix A. Linear Algebra and Functional Analysis 363 





For any point x € R”, the point in Y that is closest to x is its orthogonal projection Px, 
as the following theorem shows. 


Theorem A.5: Orthogonal Projection and Minimal Distance 





Proof: We can write each point y € V as y = yy a; u;. Consequently, 


k k k k 

2 2 2 

ly xl? = (x - X run x- X oru) = Ile? -2 X a; œu) + 02. 
i=1 i=1 i=1 i=1 


Minimizing this with respect to the {a;} gives a; = (x,u;),i = 1,...,k. In view of (A.8), 
the optimal y is thus Px. o 


A.5 Eigenvalues and Eigenvectors 


Let A be an n X n matrix. If Av = Av for some number 4 and non-zero vector v, then 4 is 

called an eigenvalue of A with eigenvector v. DONNE 
If (A, v) is an (eigenvalue, eigenvector) pair, the matrix AI — A maps any multiple of v BIGENVRCTOR 

to the zero vector. Consequently, the columns of AI — A are linearly dependent, and hence 

its determinant is 0. This provides a way to identify the eigenvalues, namely as the r < n 


different roots 41, . . . , A, of the characteristic polynomial CHARACTERISTIC 
POLYNOMIAL 


det(AI — A) = (A-A,)™ ---(A-A,)”, 


where a, +-+: +a, = n. The integer a; is called the algebraic multiplicity of A;. The 


eigenvectors that correspond to an eigenvalue 4; lie in the kernel or null space of the matrix MULTPLICITY 
Ail — A; that is, the linear space of vectors v such that (A;I — A)v = 0. This space is called NULL SPACE 
the eigenspace of 4i. Its dimension, d; € {1,...,n}, is called the geometric multiplicity of E 
4i. It always holds that d; < q;. If }};d; = n, then we can construct a basis for R” consisting MULTIPLICITY 


of eigenvectors, as illustrated next. 


E Example A.4 (Linear Transformation (cont.)) We revisit the linear transformation in 
Figure A.1, where 


1 1 
Ee ok 
The characteristic polynomial is (A — 1)(A + 2) + 1/2, with roots 2; = —1/2 - V7/2 


—1.8229 and A, = -1/2 + V7/2 ~ 0.8229. The corresponding unit eigenvectors are vı 
[0.3339, —0.9426]" and v2 ~ [0.9847, -0.1744]". The eigenspace corresponding to A, is 


Q 


2 
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SEMI-SIMPLE 
DIAGONALIZABLE 


EIGEN- 
DECOMPOSITION 


Vı = Span {vı} = {Gv, : B € R} and the eigenspace corresponding to 42 is V2 = Span {v2}. 
The algebraic and geometric multiplicities are 1 in this case. Any pair of vectors taken 
from V; and V, forms a basis for R?. Figure A.3 shows how v; and vz are transformed to 
Ay; € V; and Av, € V2, respectively. 











Figure A.3: The dashed arrows are the unit eigenvectors v; (blue) and v2 (red) of matrix A. 
Their transformed values Av, and Ay, are indicated by solid arrows. E 


A matrix for which the algebraic and geometric multiplicities of all its eigenvalues 
are the same is called semi-simple. This is equivalent to the matrix being diagonalizable, 
meaning that there is a matrix V and a diagonal matrix D such that 


A = VDV”. 


To see that this so-called eigen-decomposition holds, suppose A is a semi-simple matrix 
with eigenvalues 


Let D be the diagonal matrix whose diagonal elements are the eigenvalues of A, and let V 
be a matrix whose columns are linearly independent eigenvectors corresponding to these 
eigenvalues. Then, for each (eigenvalue, eigenvector) pair (4, v), we have Av = Av. Hence, 
in matrix notation, we have AV = VD, and so A = VDV"!. 


A.5.1 Left- and Right-Eigenvectors 


The eigenvector as defined in the previous section is called a right-eigenvector, as it lies on 
the right of A in the equation Av = Av. 

If A is a complex matrix with an eigenvalue A, then the eigenvalue’s complex conjugate 
2 is an eigenvalue of A*. To see this, define B := AI — A and B* := AI — A*. Since 4 is 
an eigenvalue, we have det(B) = 0. Applying the identity det(B) = det(B*), we see that 
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therefore det(B*) = 0, and hence that A is an eigenvalue of A*. Let w be an eigenvector 
corresponding to 4. Then, A*w = Aw or, equivalently, 
w A= aw". 
For this reason, we call w* the left-eigenvector of A for eigenvalue A. If v is a (right-) ei- LEFT- 
EIGENVECTOR 


genvector of A, then its adjoint v* is usually not a left-eigenvector, unless A*A = AA* 
(such matrices are called normal; a real symmetric matrix is normal). However, the im- 
portant property holds that left- and right-eigenvectors belonging to different eigenvalues 
are orthogonal. Namely, if w* is a left-eigenvalue of 2, and v a right-eigenvalue of A, # 4, 
then 

Aw v = w*Ay = Aw’, 


which can only be true if w*v = 0. 


Theorem A.6: Schur Triangulation 





Proof: The proof is by induction on the dimension n of the matrix. Clearly, the statement 
is true for n = 1, as A is simply a complex number and we can take U equal to 1. Suppose 
that the result is true for dimension n. We wish to show that it also holds for dimension 
n+ 1. Any matrix A always has at least one eigenvalue A with eigenvector v, normalized 
to have length 1. Let U be any unitary matrix whose first column is v. Such a matrix can 
always be constructed’. As U is unitary, the first row of U~! is v*, and U~'AU is of the form 


elif 


for some matrix B. By the induction hypothesis, there exists a unitary matrix W and an 
upper triangular matrix T such that W-'BW = T. Now, define 


1107 
N w 
Then, 


1| o aj» [10 al > A| * 

-1 -1 — — = 

vo auyy = aby ll “|o ww | | ofT) 
which is upper triangular of dimension n + 1. Since UV is unitary, this completes the 
induction, and hence the result is true for all n. o 








The theorem above can be used to prove an important property of Hermitian matrices, 
i.e., matrices for which A* = A. 





> After specifying v we can complete the rest of the unitary matrix via the Gram-Schmidt procedure, for 
example; see Section A.6.4. 


NORMAL MATRIX 
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Theorem A.7: Eigenvalues of a Hermitian Matrix 





Proof: Let A be a Hermitian matrix. By Theorem A.6 there exists a unitary matrix U such 
that U-'AU = T, where T is upper triangular. It follows that the adjoint (U-'AU)* = T* 
is lower triangular. However, (U-'AU)* = U~'AU, since A* = A and U* = U"!. Hence, 
T and T* must be the same, which can only be the case if T is a real diagonal matrix D. 
Since AU = DU, the diagonal elements are exactly the eigenvalues and the corresponding 
eigenvectors are the columns of U. o 


In particular, the eigenvalues of a real symmetric matrix are real. We can now repeat 
the proof of Theorem A.6 with real eigenvalues and eigenvectors, so that there exists an 
orthogonal matrix Q such that Q7!AQ = Q™AQ = D. The eigenvectors can be chosen as 
the columns of Q, which form an orthonormal basis. This proves the following theorem. 


Theorem A.8: Real Symmetric Matrices are Orthogonally Diagonizable 





E Example A.5 (Real Symmetric Matrices and Ellipses) As we have seen, linear trans- 
formations map circles into ellipses. We can use the above theory for real symmetric 
matrices to identify the principal axes. Consider, for example, the transformation with mat- 
rix A = [1, 1; —1/2,—2] in (A.1). A point x on the unit circle is mapped to a point y = Ax. 
Since for such points ||x||? = xTx = 1, we have that y satisfies y'(A~!)"A7!y = 1, which 
gives the equation for the ellipse 


yi, yy. 8%] 
9 9 9 
Let Q be the orthogonal matrix of eigenvectors of the symmetric matrix (A7!)"A7! = 
(AATY!, so QT(AAT)!Q = D for some diagonal matrix D. Taking the inverse on both 
sides of the previous equation, we have QTAATQ = D"!, which shows that Q is also the 
matrix of eigenvectors of AA". These eigenvectors point precisely in the direction of the 
principal axes, as shown in Figure A.4. It turns out, see Section A.6.5, that the square roots 
of the eigenvalues of AA", here approximately 2.4221 and 0.6193, correspond to the sizes 
of the principal axes of the ellipse, as illustrated in Figure A.4. 
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-1 0 1 
Figure A.4: The eigenvectors and eigenvalues of AA” determine the principal axes of the 
ellipse. a 


The following definition generalizes the notion of positivity of a real variable to that of a 
(Hermitian) matrix, providing a crucial concept for multivariate differentiation and optim- 
ization; see Appendix B. 


Definition A.3: Positive (Semi)Definite Matrix 


A Hermitian matrix A is called positive semidefinite (we write A > 0) if (Ax, x) > 0 
for all x. It is called positive definite (we write A > 0) if (Ax, x) > 0 for all x + 0. 





The positive (semi)definiteness of a matrix can be directly related to the positivity of 
its eigenvalues, as follows: 


Theorem A.9: Eigenvalues of a Positive Semidefinite Matrix 





Proof: Let A be a positive semidefinite matrix. By Theorem A.7, the eigenvalues of A are 
all real. Suppose 4 is an eigenvalue with eigenvector v. As A is positive semidefinite, we 
have 


0 < (Av, v) = à v, v) = All, 
which can only be true if 2 > 0. Similarly, for a positive definite matrix, 4 must be strictly 
greater than 0. m 
Corollary A.1 Any real positive semidefinite matrix A can be written as 
A = BB' 


for some real matrix B. Conversely, for any real matrix B, the matrix BB" is positive 
semidefinite. 


t 397 


POSITIVE 
SEMIDEFINITE 
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PERMUTATION 
MATRIX 


PLU 
DECOMPOSITION 


Proof: The matrix A is both Hermitian (by definition) and real (by assumption) and hence 
it is symmetric. By Theorem A.8, we can write A = QDQ', where D is the diagonal 
matrix of (real) eigenvalues of A. By Theorem A.9 all eigenvalues are non-negative, and 
thus their square root is real-valued. Now, define B = Q vD, where VD is defined as the 
diagonal matrix whose diagonal elements are the square roots of the eigenvalues of A. 
Then, BB” = QVD(VD)'Q™ = QDQ' = A. The converse statement follows from the 
fact that x" BB'x = ||BT x|? > 0 for all x. o 


A.6 Matrix Decompositions 


Matrix decompositions are frequently used in linear algebra to simplify proofs, avoid nu- 
merical instability, and to speed up computations. We mention three important matrix de- 
compositions: (P)LU, QR, and SVD. 


A.6.1 (P)LU Decomposition 
Every invertible matrix A can be written as the product of three matrices: 
A = PLU, (A.9) 


where L is a lower triangular matrix, U an upper triangular matrix, and P a permutation 
matrix. A permutation matrix is a square matrix with a single 1 in each row and column, 
and zeros otherwise. The matrix product PB simply permutes the rows of a matrix B and, 
likewise, BP permutes its columns. A decomposition of the form (A.9) is called a PLU 
decomposition. As a permutation matrix is orthogonal, its transpose is equal to its inverse, 
and so we can write (A.9) as 

P'A = LU. 


The decomposition is not unique, and in many cases P can be taken to be the identity 
matrix, in which case we speak of the LU decomposition of A, also called the LR for 
left-right (triangular) decomposition. 

A PLU decomposition of an invertible n x n matrix Ao can be obtained recursively as 
follows. The first step is to swap the rows of Ag such that the element in the first column and 
first row of the pivoted matrix is as large as possible in absolute value. Write the resulting 
matrix as 

dı d 


PoAo = i D; 


where Py is the permutation matrix that swaps the first and k-th row, where k is the row 
that contains the largest element in the first column. Next, add the matrix —c,[1, bi /a,] to 
the last n — 1 rows of PoAo, to obtain the matrix 


dı bi _, dı bi 

0 D; = cibi /a, 0 Ay i 
In effect, we add some multiple of the first row to each of the remaining rows in order to 
obtain zeros in the first column, except for the first element. 


We now apply the same procedure to A, as we did to Ag and then to subsequent smaller 
matrices A»,...,A,—-1: 
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1. Swap the first row with the row having the maximal absolute value element in the 
first column. 


2. Make every other element in the first column equal to 0 by adding appropriate mul- 
tiples of the first row to the other rows. 


Suppose that A, has a PLU decomposition P,L,U,. Then it is easy to check that 


a 1 (0 1 07 a bT 
T t i 

Pı k 4 aie | f a (A.10) 

—— > 

e Ez t-1 


is a PLU decomposition of A;_;. Since the PLU decomposition for the scalar A,_; is trivial, 
by working backwards we obtain a PLU decomposition PoLoUo of A. 


E Example A.6 (PLU Decomposition) Take 


0 -1 7 
A={|3 2 OJ. 
1 1 1 





Our goal is to modify A via Steps 1 and 2 above so as to obtain an upper triangular matrix 
with maximal elements on the diagonal. We first swap the first and second row. Next, we 
add —1/3 times the first row to the third row and 1/3 times the second row to the third row: 


0 -1 7 3 2 0 3 2 0 3 2 0 
3 2 O;—j0 -1 7|—|0 -1 7|—]ļ|0 -1 7 |. 
1 1 1 1 1 1 0 1/3 1 0 0 10/3 


The final matrix is Up, and in the process we have applied the permutation matrices 


5 1 0 
> P, =l il 


Using the recursion (A.10) we can now recover Po and Lo. Namely, at the final iteration 
we have P, = 1, L, = 1, and U, = 10/3. And subsequently, 


O0 10 
1 0 0 
00 1 


Po = 

















010 1 0 0 
P =l T TE i P=|1 0 ol, Io=|0 1 ol, 
00 1 1/3 -1/3 1 

observing that a, = 3,c; = [0,1]", a2 = -1, and c2 = 1/3. E 


PLU decompositions can be used to solve large systems of linear equations of the form 
Ax = b efficiently, especially when such an equation has to be solved for many different b. 
This is done by first decomposing A into PLU, and then solving two triangular systems: 


1. Ly = P'b. 


2. Ux = y. 
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FORWARD 
SUBSTITUTION 
BACKWARD 
SUBSTITUTION 


The first equation can be solved efficiently via forward substitution, and the second via 
backward substitution, as illustrated in the following example. 


E Example A.7 (Solving Linear Equations with an LU Decomposition) Let A = PLU 
be the same as in Example A.6. We wish to solve Ax = [1,2,3]'. First, solving 


1 o Olly, 2 
0 1 0 y2) = 1 
1/3 -1/3 I}hy3 3 
gives, yı = 2, y2 = 1 and y3 = 3 — 2/3 + 1/3 = 8/3, by forward substitution. Next, 
3 2 0 lx 2 
0 -l fi X| = 1 
0 0 10/3]]}x3 8/3 


gives x3 = 4/5,x2 = —1 + 28/5 = 23/5, and x, = 2(1 — 23/5)/3 = —12/5, so x 
[-12, 23, 4]"/S. E 





A.6.2 Woodbury Identity 


LU (or more generally PLU) decompositions can also be applied to block matrices. A 
starting point is the following LU decomposition for a general 2 x 2 matrix: 


a b|_|a 0 1 b/a 

c d| |c d-—bc/a\|O 1 
which holds as long as a + 0; this can be seen by simply writing out the matrix product. 
The block matrix generalization for matrices A € R”, B € R™,C € RP”, D e RPS is 


J A B = A Onxk I, A'B 
iek AE jal I, | ne 


provided that A is invertible (again, write out the block matrix product). Here, we use the 
notation Ox, to denote the p x q matrix of zeros. We can further rewrite this as: 


y= I, Onxx A Onxk I, A'B 
T [CA XK |(O D- CAB| |On I |’ 


Thus, inverting both sides, we obtain 


| A Onxk | | I, ae 


si [|h A'B 
E On D-CA'B} [CA7 I 


Okxn I, 





Inversion of the above block matrices gives (again write out) 


I, -A~'B][ A“ o t ð 
-1 n nxk n nxk 
n pi I, | Orxxn (D — r E I, | g (A.12) 





Assuming that D is invertible, we could also perform a block UL (as opposed to LU) 
decomposition: 
A-BD'C B|| I, Ong 
js | a J E M (A.13) 
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which, after a similar calculation as the one above, yields 


a I, 7 | oo pl I, al 


-D'C I, Orxxn D™ || Oxxn I, an) 


The upper-left block of =~! from (A.14) must be the same as the upper-left block of £~! 
from (A.12), leading to the Woodbury identity: 
(A - BDC)! = A™! +A 'B(D- CA'B)'CAT. (A.15) 


From (A.11) and the fact that the determinant of a product is the product of the determ- 
inants, we see that det(Z) = det(A)det(D — CA™'B). Similarly, from (A.13) we have 
det(Z) = det(A — BD~!C) det(D), leading to the identity 


det(A — BD~'C) det(D) = det(A) det(D — CA™'B). (A.16) 


The following special cases of (A.16) and (A.15) are of particular importance. 


Theorem A.10: Sherman—Morrison Formula 





Proof: Take B = x, C = —y', and D = 1 in (A.16) and (A.15). o 


One important application of the Sherman—Morrison formula is in the efficient solution 
of the linear system Ax = b, where A is ann x n matrix of the form: 


P 
_ ; T 
A = Ao + aja; 
j=l 


for some column vectors 41, ..., ap € R” and nxn diagonal (or otherwise easily invertible) 
matrix Ao. Such linear systems arise, for example, in the context of ridge regression and 
optimization. 

To see how the Sherman—Morrison formula can be exploited, define the matrices 
Ao, . . ., Ap via the recursion: 

Ay = Ay; + aza}, k=1,...,p. 
Application of Theorem A.10 for k = 1,..., p yields the identities:? 
Azara; Ay 
-1 

1+ aA, a 
Ail = lAx-1] x (1 + a; Ajax). 


AY = Ax- 





3Here |A| is a shorthand notation for det(A). 


WOODBURY 
IDENTITY 
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Therefore, by evolving the recursive relationships up until k = p, we obtain: 
-1 Ta-l 
Ay) aja; Ay 


Ay! =Aj'- 
p 0 Tan 
l +a; A; aj 


M= 


— 
`” Ill 
= 


|Ap| = lAol x | | (1 + @;A;,4)). 


J 


Il 
an 


These expressions will allow us to easily compute A~! = A’ and |A| = |A,| provided the 
following quantities are available: 


e,j:= Aja; k=1,....p-1, j=k+1,...,p. 
Since, by Theorem A.10, we can write: 


-1 T 4-1 
Aj Apa 


zia kel 
Az14; = Ayo4j — j 





T -1 
1+ ay, Ayn k-1 
the quantities {c, ;} can be computed from the recursion: 


cij = Ap'a;, j=1,...,p 


Ay Ck-1,j (A.17) 


Ck j = Ck-1,j) — Ck-1,k-1> k= 2x De j=k,...,p 


n 
1+ Ay Ck-1,k-1 


Observe that this recursive computation takes O(p7n) time and that once {c,. j} are available, 
we can express A`! and |A] as: 


P 
\A| = |Aol x BIK +ac;;). 
j=l 


In summary, we have proved the following. 


Theorem A.11: Sherman—Morrison Recursion 
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As a consequence of Theorem A.11, the solution to the linear system Ax = b can be 
computed in O(p°n) time via: 


x = Aj)'b-CD™'[C'D]. 


If n > p, the Sherman—Morrison recursion can frequently be much faster than the O(n’) 

direct solution via the LU decomposition method in Section A.6.1. c= 368 
In summary, the following algorithm computes the matrices C and D in Theorem A.11 

via the recursion (A.17). 


Algorithm A.6.1: Sherman—Morrison Recursion 
input: Easily invertible matrix Ap and column vectors a1, ..., ap. 
output: Matrices C and D such that CD"'C™ = Aj! - (Ao +> ja ja’). 
1 C Aj! ak fork = 1,..., p (assuming Ao is diagonal or easily invertible matrix) 
2 fork =1,...,p—1do 
3 dy —1+ a/c, 
4 for j=k+1,...,p do 


i a; Cj 
5 c;c; Ch 
J J 

dk 





6 d, = 1 +a;c, 

7 Ce [Cip] 

s D e diag(d),...,dp) 
9 return C and D 


Finally, note that if Ag is a diagonal matrix and we only store the diagonal elements of 
D and Ap (as opposed to storing the full matrices D and Ag), then the storage or memory 
requirements of Algorithm A.6.1 are only O(pn). 


A.6.3 Cholesky Decomposition 


If A is a real-valued positive definite matrix (and therefore symmetric), e.g., a covariance 
matrix, then an LU decomposition can be achieved with matrices L and U = L”. 


Theorem A.12: Cholesky Decomposition 





374 


A.6. Matrix Decompositions 





Proof: The proof is by inductive construction. For k = 1,...,n, let A; be the left-upper 
kx k submatrix of A = A,. With e; := [1,0,...,0]', we have A; = a; = e] Ae, > 0 by the 
positive-definiteness of A. It follows that lı = ya11. Suppose that Ay_; has a Cholesky fac- 
torization Le- Li with L,_; having strictly positive diagonal elements, we can construct 
a Cholesky factorization of A; as follows. First write 
A, = nee ad 
ay) Akk 
and propose L; to be of the form 


for some vector J,_; € R‘! and scalar ly, for which it must hold that 
ka Ha _ ra al L hei | 
ay ae | (ba al] O ae | 
To establish that such an J,_, and lą exist, we must verify that the set of equations 
Lykke = Ae 1 (A.19) 
Eales + Ê, = Akk 
has a solution. The system Lz-ılzx-1 = az-ı has a unique solution, because (by assump- 
tion) Lz-ı is lower diagonal with strictly positive entries down the main diagonal and we 
can solve for J,_, using forward substitution: J,_, = Ly) a1. We can solve the second 
equation as J, = yap — ||l4l[?, provided that the term within the square root is positive. 
We demonstrate this using the fact that A is a positive definite matrix. In particular, for 
x € R” of the form [x], X2,0']", where x; is a non-zero (k — 1)-dimensional vector and x2 
a non-zero number, we have 


: 
Lr-L;; a1 


0 < x'Ax = [x], x2] | | |x, x2] = |L] xil? + 2x [ay 1x2 + agx. 


Š 
ai Akk 

Now take xı = -x2 L77, i: to obtain 0 < xTAx = x5 (ag — lllz-1|7). Therefore, (A.19) 

can be uniquely solved. As we have already solved it for k = 1, we can solve it for any 

k =1,...,n, leading to the recursive formula (A.18) and Algorithm A.6.2 below. oO 


An implementation of Cholesky’s decomposition that uses the notation in the proof of 
Theorem A.6.3 is the following algorithm, whose running cost is O(n’). 


Algorithm A.6.2: Cholesky Decomposition 

input: Positive-definite n x n matrix A, with entries {a;;}. 

output: Lower triangular L, such that L,L; = Ay. 

L; — Vai 

for k =2,...,ndo 

ar- = [dik -< > ak1] 

l; — Li! ;@x-1 (computed in O(k*) time via forward substitution) 


5 lig — Vakk — lilk 
Lı 0 
L 
6 k ra A 


Aa U N e 





return L, 


N 
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A.6.4 QR Decomposition and the Gram-Schmidt Procedure 


Let A be an n x p matrix, where p < n. Then, there exists a matrix Q € R”? satisfying 
Q'Q =I,, and an upper triangular matrix R € R’”?, such that 


A=QR. 


This is the QR decomposition for real-valued matrices. When A has full column rank, such 
a decomposition can be obtained via the Gram—Schmidt procedure, which constructs an 
orthonormal basis {u,,...,U,} of the column space of A spanned by {a1,...,ap}, in the 
following way (see also Figure A.5): 


1. Take u, = a;/lļa;ll. 


GRAM-—SCHMIDT 


2. Let p; be the projection of az onto Span {u,}. That is, p = (u1, a2) u1. Now take 
uz = (a2 — p, )/|la2 — p; ||. This vector is perpendicular to u, and has unit length. 


3. Let p, be the projection of a3 onto Span {u;, u2}. That is, p, = (U1, a3) Uy +(U2, a3) U2. 
Now take u3 = (a3 — p,)/||a3 — pll. This vector is perpendicular to both u; and u2 
and has unit length. 


4. Continue this process to obtain u4, ... , up. 





| 








Figure A.5: Illustration of the Gram-Schmidt procedure. 


At the end of the procedure, a set {u;,...,u,} of p orthonormal vectors are obtained. 

Consequently, as a result of Theorem A.3, 
j 
aj= X (aju) u; j=1,...,p, 
j=1 


— 
fij 


l 
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for some numbers r;;, j = 1,...,1, i = 1,..., p. Denoting the corresponding upper triangu- 
lar matrix [r;;] by R, we have in matrix notation: 


Fii 112 13° «+. Vip 
0 122 +23... ‘12 
QR = [u;,..., up] oe a. % ig = [a,...,a,] =A, 
0 0 0 .. rp 
which yields a QR decomposition. The QR decomposition can be used to efficiently solve 
least-squares problems; this will be shown shortly. It can also be used to calculate the 
determinant of the matrix A, whenever A is square. Namely, det(A) = det(Q)det(R) = 
det(R); and since R is triangular, its determinant is the product of its diagonal elements. 
There exist various improvements of the Gram-Schmidt process (for example, the House- 
holder transformation [52]) that not only improve the numerical stability of the QR de- 
composition, but also can be applied even when A is not full rank. 
An important application of the QR decomposition is found in solving the least-squares 
problem in O(p? n) time: 
3 2 2 
mn [XB — yll 
for some X € R"? (model) matrix. Using the defining properties of the pseudo-inverse in 
Definition A.2, one can show that ||KX*y — y||? < |[XB— yll? for any £. In other words, B := 
X*y minimizes ||XB — yll. If we have the QR decomposition X = QR, then a numerically 
stable way to calculate B with an O(p’ n) cost is via 


~ 


B = (QR)*y = R*Q*y =R'Q'y. 


If X has full column rank, then R* = R™!. 

Note that while the QR decomposition is the method of choice for solving the ordinary 
least-squares regression problem, the Sherman—Morrison recursion is the method of choice 
for solving the regularized least-squares (or ridge) regression problem. 


A.6.5 Singular Value Decomposition 


One of the most useful matrix decompositions is the singular value decomposition (SVD). 


Theorem A.13: Singular Value Decomposition 





Proof: Without loss of generality we can assume that m > n (otherwise consider the trans- 
pose of A). Then A*A is a positive semidefinite Hermitian matrix, because (A*Av,v) = 
v*A*Ap = ||Av||? > 0 for all v. Hence, A*A has non-negative real eigenvalues, 2; > A > 
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- >A, > 0. By Theorem A.7 the matrix V = [v,,...,v,] of right-eigenvectors is a unit- 
ary matrix. Define the i-th singular value as o; = VA,,i = 1,...,n and suppose 2,,...,A,; SINGULAR VALUE 
are all greater than 0, and A,41,...,4, = 0. In particular, Av; = 0 fori = r+ 1,...,n. Let 
ui = Ay;/o;,i = 1,...,r. Then, for i, j < r, 


vjA*Avi a Hi= fj} 


OiO j OiO j 





(Ui, Uj} = Uju; = =l{i= j}. 


We can extend u4, ...,u, to an orthonormal basis {u,,...,u,} of C” (e.g., using the Gram- 
Schmidt procedure). Let U = [u;,...,u„] be the corresponding unitary matrix. Defining & 
to be the m x n diagonal matrix with diagonal (o7),...,0,,0,...,0), we have, 


Ux = [Ay,,..., Av,,0,...,0] = AV, 
and hence A = UXV*. Oo 
Note that 
AA* = UXV*VX'U* = UŁŁ'U* and A*A = VX U*UŁXV* = VX' XV”. 


So, U is a unitary matrix whose columns are eigenvectors of AA* and V is a unitary matrix 
whose columns are eigenvectors of A*A. 

The SVD makes it possible to write the matrix A as a sum of rank-1 matrices, weighted 
by the singular values {0;}: 


(0 0 0 R 
vi 

0 0 Ol] ys r 

A= |ui, U2, . -y ttm] 0 Ge. as 0 = =) oih (A.20) 

0 0. 0 j i=l 
Va 

0 0 

which is called the dyade or spectral representation of A. SPECTRAL 


For real-valued matrices, the SVD has a nice geometric interpretation, illustrated in REPRESENTATION 


Figure A.6. The linear mapping defined by matrix A can be thought of as a succession of 
three linear operations: (1) an orthogonal transformation (i.e., a rotation with a possible 
flipping of some axes), corresponding to matrix V', followed by (2) a simple scaling of 
the unit vectors, corresponding to X, followed by (3) another orthogonal transformation, 
corresponding to U. 


2 2 2 2 
oo © of © 0 <> o 
-2 , -2 , -2 » -2 








Ag) 2A danla 4220 8 4 4205 4 


Figure A.6: The figure shows how the unit circle and unit vectors (first panel) are first 
rotated (second panel), then scaled (third panel), and finally rotated and flipped. 
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t 367 


rs 406 


E Example A.8 (Ellipses) We continue Example A.5. Using the svd method of the mod- 
ule numpy . linalg, we obtain the following SVD matrices for matrix A: 


—0.3975 0.9176 


—0.9176 —0.3975]° 


y- [70-5430 0.8398], _ [24221 0 
~ 10.8398 0.5430| “7| 0 0.6193 


|. ang | 


Figure A.4 shows the columns of the matrix UX as the two principal axes of the ellipse that 
is obtained by applying matrix A to the points of the unit circle. E 


A practical method to compute the pseudo-inverse of a real-valued matrix A is via the 
singular value decomposition A = UŁV', where & is the diagonal matrix collecting all the 
positive singular values, say 7),...,@,, as in Theorem A.13. In this case, AT = VETU'," 
where X* is the n x m diagonal (pseudo-inverse) matrix: 


a 0 0 
0 0 0 
x* =| 0 o! «<a Ol; 
0 0... 0 
0 . 0 


We conclude with a typical application of the pseudo-inverse for a least-squares optim- 
ization problem from data science. 


E Example A.9 (Rank-Deficient Least Squares) Given is an n x p data matrix 


Xil X12 °°) Xip 

Xa X22 e Xp 
X=|. 

Xni Xn2 `° Xnp 


It is assumed that the matrix is of full row rank (all rows of X are linearly independent) and 
that the number of rows is less than the number of columns: n < p. Under this setting, any 
solution to the equation X$ = y provides a perfect fit to the data and minimizes (to 0) the 
least-squares problem pi 

B = argmin IX - yl. (A.21) 

BER? 

In particular, if B* minimizes |[XB — y||? then so does B* + u for all u in the null space 
Nx := {u : Xu = 0}, which has dimension p — n. To cope with the non-uniqueness of 
solutions, a possible approach is to solve instead the following optimization problem: 


a . . T 
es BB 
subjectto XB-y=0. 


That is, we are interested in a solution Ø with the smallest squared norm (or, equival- 
ently, the smallest norm). The solution can be obtained via Lagrange’s method (see Sec- 
tion B.2.2). Specifically, set L(B, A) = B'B — A' (XB — y), and solve 


Vp L(B, a) = 2B - X™A=0, (A.22) 
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and 
VaL, A) = XB -y = 0. (A.23) 


From (A.22) we get B = X"A/2. By substituting it in (A.23), we arrive at A = 2(XX")"‘y, 
and hence $ is given by 





Xx! X™2(XXT)y! 
B= xA = SA y = X"(XX')'y = X*y. 


An example Python code is given below. 


svdexample.py 


from numpy import diag, zeros,vstack 
from numpy.random import rand, seed 
from numpy.linalg import svd, pinv 
seed (12345) 

n 5 

8 


rand(n,p) 
rand(n,1) 
„S, VE = svd(X) 
I = diag(1/S) 


nt tt uo i 


# compute pseudo inverse 

pseudo_inv = VT.T @ vstack((SI, zeros((p-n,n)))) @ U.T 

b = pseudo_inv @ y 

#b = pinv(X) @ y #remove comment for the built-in pseudo inverse 
print(X @b - y) 


.55111512e-16] 
.11022302e-16] 
.55111512e-16] 
.60422844e-16] 
.22044605e-16] ] 





A.6.6 Solving Structured Matrix Equations 


For a general matrix A € C”"”, performing matrix—vector multiplications takes O(n”) op- 
erations; and solving linear systems Ax = b, and carrying out LU decompositions takes 
O(n?) operations. However, when A is sparse (i.e., has relatively few non-zero elements) 
or has a special structure, the computational complexity for these operations can often be 
reduced. Matrices A that are “structured” in this way often satisfy a Sylvester equation, of SYLVESTER 
the form EQUATION 


MA — AM, = GiG,, (A.24) 


where and M; € C”",i = 1,2 are sparse matrices and G; € C’”’, i = 1,2 are matrices of 

rank r <n. The elements of A must be easy to recover from these matrices, e.g., with 

O(1) operations. A typical example is a (square) Toeplitz matrix, which has the following ToEPLITZ MATRIX 
structure: 
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ao Ay *** A-(n-2) A-(n-1) 
ay a -ıı a_(n-2) 
A=|: a, a : 
Qn-2 "ee E: a_| 
An-1 Gn-2 °*"° dı ao 


A general square Toeplitz matrix A is completely determined by the 2n — 1 elements along 
its first row and column. If A is also Hermitian (i.e., A* = A), then clearly it is determined 
by only n elements. If we define the matrices: 








00 >. 0 1 0O 11.. 0 O 
10 0 0 0 0 1 0 
M; =|: 0 and M> = : 0 0 : > 
0 Ut. 0 0 ne | 
00 --- 1 0 -1 0 +--> 0 0 
then (A.24) is satisfied with 
1 0 
0 a+ a_(n-1) 5 
10 az +a An-1 ~ A-1 QAn-27 4-2 ... A, = A-n-1) 440 
Sra E 0 .., 0 1 
0 dy-1+a_1 
Qn-1 T A-1 An-2 — A-2 ... Q1 = A-(n-1) 2do 
0 0 iis 0 a, + A_(n-1) 
= : : air : a2 + d-(n-2) ` 
0 0 pia 0 An-| + 4-1 


which has rank r < 2. 


E Example A.10 (Discrete Convolution of Vectors) The convolution of two vectors 

can be represented as multiplication of one of the vectors by a Toeplitz matrix. Sup- 

pose a = [a),...,a,]' and b = [b,,...,b,]" are two complex-valued vectors. Then, their 
CONVOLUTION convolution is defined as the vector a x b with i-th element 


[a + b]; = >) ar bizen, i=1,...,n, 


k=1 


where b; := 0 for j < 0. It is easy to verify that the convolution can be written as 


a x b = Ab, 
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where, denoting the d-dimensional column vector of zeros by 04, we have that 


a 0 
0,-1 a 
A= 0n-2 0n-2 
a On-1 
0 a 
Clearly, the matrix A is a (sparse) Toeplitz matrix. Oo 
A circulant matrix is a special Toeplitz matrix which is obtained from a vector c by CIRCULANT 
circularly permuting its indices as follows: yee 
Co Cn-1 ss C2 C1 
C] Co Cn-1 C2 
C=| : CG Ge oo i. (A.25) 
Cn-2 78 ia Cn-1 
Cn-1 Cn-2 T C1 Co 


Note that C is completely determined by the n elements of its first column, c. 
To illustrate how structured matrices allow for faster matrix computations, consider 
solving the n x n linear system: 


A, Xn = Ay 
for x, = [X1,...,X,]', where a, = [a),...,a,]', and 
1 dı oes An-2 An-1 
dı 1 a dn-2 
aQn-2 i na ay 
An-1 Gn-2 *** aj 1 


is a real-valued symmetric positive-definite Toeplitz matrix (so that it is invertible). Note 
that the entries of A, are completely determined by the right-hand side of the linear equa- 
tion: vector a,. As we shall see shortly in Example A.11, the solution to the more general 
linear equation A, x, = b,, where b, is arbitrary, can be efficiently computed using the 


solution to this specific system A,X, = an, obtained via a special recursive algorithm 
(Algorithm A.6.3 below). 
For every k = 1,...,n the k x k Toeplitz matrix A, satisfies 
Ay = P; Ay Px, 


where P; is a permutation matrix that “flips” the order of elements — rows when pre- 
multiplying and columns when post-multiplying. For example, 


0000 1 

oa a k oe where P : f : ; j 
5 = > 5= : 

6 7 8 9 10 10 9 8 7 6 01000 
1000 0 
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LEVINSON— 
DURBIN 


Clearly, P, = P and P,P, = I; hold, so that in fact P, is an orthogonal matrix. 

We can solve the n x n linear system A,x,, = a, in O(n’) time recursively, as follows. 
Assume that we have somehow solved for the upper k x k block Ag x, = a, and now we 
wish to solve for the (k + 1) x (k + 1) block: 


S Ag Pa z = ak 
Aki Xil = ak > pa 1 E =le 


Therefore, 


7 
Q = Aky1 — A, Pz 
Az = d; — Q P; ay. 


Since A;' P; = P; A;', the second equation above simplifies to 


Z =A a, — œ A7' Pray 


= x — Q P; x}. 


Substituting z = xx — «æ P; x; into œ = akı — a, Px z and solving for «œ yields: 





Finally, with the value of œ computed above, we have 


XE a P, Xk 
Xk+1 = & . 


This gives the following Levinson—Durbin recursive algorithm for solving A, x, = apy. 


Algorithm A.6.3: Levinson—Durbin Recursion for Solving A, Xn = an 


input: First row [1,a),...,d@,-1] = [1,a/_,] of matrix A,. 
output: Solution x, = A7! a,. 
1X, & dı 


2 fork =1,...,n—1do 

3 Be — 1- aj xx 

4 eS (tee teal" 
5 æ — (ak1 — A, X)/Bx 
a 


6 Xk+1 — 


7 return x, 





In the algorithm above, we have identified x; = [x,1, X12,---, Xxx]. The advantage of 
the Levinson—Durbin algorithm is that its running cost is O(n’), instead of the usual O(n°). 

Using the {xz, 6r} computed in Algorithm A.6.3, we construct the following lower tri- 
angular matrix recursively, setting L; = 1 and 


L; 0; 


Lgi = l- (P,.x,)" 1 


ls k=1,...,n-1. (A.27) 
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Then, we have the following factorization of A,. 


Theorem A.14: Diagonalization of Toeplitz Correlation Matrix A, 





Proof: We give a proof by induction. Obviously, L;A; L] = 1-1-1 = 1 = D; is true. Next, 
assume that the factorization L,A;L, = D; holds for a given k. Observe that 


LA = L; O.|| Ac Prak _ L,Ax, LP; ay 
k+14Rk+1 = —(P,x;)" 1 al P; 1 


It is straightforward to verify that [—(P,x,)' Ag + a} Pr, — (Px) Praz + 1) = (02, Bel, 
yielding the recursion 


L;A; LP: “| 
L;A; = : 
k+14åk+1 | 0; Br 


Secondly, observe that 


LA LT T L;A; LP; ak Li -Px L;A; L; > -LAP x; + LPa; 
k+14Rk4+ 144 R41 Z 0; By 0; 1 — 0; : Bk $ 
By noting that A,P,x, = PPŁA;}Px = P,A,x, = Pa, we obtain: 
LA,L; 0 
Ly Ani Li = i k 4 : 
0, = Bx 


Hence, the result follows by induction. Oo 


E Example A.11 (Solving A,,x, = b„ in O(n’) Time) One application of the factoriza- 
tion in Theorem A.14 is in the fast solution of a linear system A,,x, = bn, where the right- 
hand side is an arbitrary vector b,,. Since the solution x, can be written as 


x, = A'b, = LD, Libis 
we can compute x, in O(n’) time, as follows. 


Algorithm A.6.4: Solving A, x, = b, for a General Right-Hand Side 
input: First row [1,a;_;] of matrix A, and right-hand side b,. 
output: Solution x, = Az! By. 

1 Compute L, in (A.27) and the numbers £1, ...,Bn-1 via Algorithm A.6.3. 


2 [x1,...,X,]" <— L,b, (computed in O(n’) time) 
3 x; — x;/B;-, fori = 2,...,n (computed in O(n) time) 
4 [x1,...,Xp] — [x1,...,X,] L, (computed in O(n?) time) 


5 return x, < [x),...,Xnl" 


—(P.x,)' Az +a; Py, — (Pixi) Prag +1] 
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NORM 


COMPLETE 


Note that it is possible to avoid the explicit construction of the lower triangular matrix 
in (A.27) via the following modification of Algorithm A.6.3, which only stores an extra 
vector y at each recursive step of the Levinson—Durbin algorithm. 


Algorithm A.6.5: Solving A, x, = b, with O(n) Memory Cost 
input: First row [1,a'_,] of matrix A,, and right-hand side b,. 
output: Solution x, = A7! b,. 





1x bd, 

2y— a 

3 fork =1,...,n—1do 
4 X <— [Xk, Xp-1,---5, X1] 
5 | J = Dre Yki.. Mil 
6 | Bel-aly 

7 a, — (bii — b; ¥)/B 
8 Qy — (Ak+1 — a, y)/B 
9 x e |x- až, æ] 
10 yo [y - ağ, œ] 





11 return x 


A.7 Functional Analysis 


Much of the previous theory on Euclidean vector spaces can be generalized to vector spaces 
of functions. Every element of a (real-valued) function space H is a function from some 
set X to R, and elements can be added and scalar multiplied as if they were vectors. In 
other words, if f € H and g € H, then af +g € H for all a, 8 € R. On H we can impose 
an inner product as a mapping (-,-) from H x H to R that satisfies 


1. (afi + Bh 8) = afi, 8) + Bf, 8); 

2. (f:8) = (8S): 

3. (F, f) > 0; 

4. (f, f) = 0 if and only if f = 0 (the zero function). 


We focus on real-valued function spaces, although the theory for complex-valued 
function spaces is similar (and sometimes easier), under suitable modifications (e.g., 


(f, 8) = (8, f)). 


Similar to the linear algebra setting in Section A.2, we say that two elements f and g 
in H are orthogonal to each other with respect to this inner product if (f, g) = 0. Given an 
inner product, we can measure distances between elements of the function space H using 


the norm 
If ll := Vf. Sf). 


For example, the distance between two functions fn and fa is given by ||, — fall. The space 
H is said to be complete if every sequence of functions fi, f2,... € H for which 


Ilfm — fall + O as m,n — oo, (A.28) 
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converges to some f € H; that is, || — f,|| — O as n — œ. A sequence that satisfies (A.28) 
is called a Cauchy sequence. 

A complete inner product space is called a Hilbert space. The most fundamental Hilbert 
space of functions is the space L”. An in-depth introduction to L? requires some measure 
theory [6]. For our purposes, it suffices to assume that X C R? and that on X a measure u is 
defined which assigns to each suitable’ set A a positive number u(A) > 0 (e.g., its volume). 
In many cases of interest u is of the form 


(A) = [waar (A.29) 
A 


where w > 0 is a positive function on X which is called the density of u with respect to 
the Lebesgue measure (the natural volume measure on R“). We write (dx) = w(x) dx to 
indicate that u has density w. Another important case is where 


KA= D> wœ, (A.30) 


xe ANZ 


where w > 0 is again called the density of u, but now with respect to the counting measure 
on R? (which counts the points of Z“). Integrals with respect to measures u in (A.29) and 
(A.30) can now be defined as 


f f(x) u(dx) = f f(x) w(x) dx, 


and 


Í fœ u(dx) = X fœ wa), 


respectively. We assume for simplicity that u has the form (A.29). For measures of the 
form (A.30) (so-called discrete measures), replace integrals by sums in what follows. 


Definition A.4: L? Space 


Let X be a subset of R? with measure u(dx) = w(x) dx. The Hilbert space L7(X, w) 
is the linear space of functions from X to R that satisfy 


1 f(x) w(x) dx < 00, (A.31) 
X 


and with inner product 


(f, 8) = 1 f(x) g(x) w(x) dx. 


Let H be a Hilbert space. A set of functions {f;, i € Z} is called an orthonormal system 
if 





4Not all sets have a measure. Suitable sets are Borel sets, which can be thought of as countable unions 
of rectangles. 





CAUCHY 
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HILBERT SPACE 
MEASURE 


DENSITY 


ORTHONORMAL 
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BASIS 


FOURIER 
EXPANSION 


1. the norm of every f; is 1; that is, (f;, f;) = 1 for alli €e Z, 
2. the {f;} are orthogonal; that is, (fi, f;) = 0 fori + j. 


It follows then that the {f;} are linearly independent; that is, the only linear combination 
2; a; f;(x) that is equal to fi(x) for all x is the one where a; = 1 and a; = 0 for j # i. An 
orthonormal system {f;} is called an orthonormal basis if there is no other f € H that is 
orthogonal to all the {f;, i € Z} (other than the zero function). Although the general theory 
allows for uncountable bases, in practice? the set J is taken to be countable. 


E Example A.12 (Trigonometric Orthonormal Basis) Let H be the Hilbert space 
L?((O, 27), u), where u(dx) = w(x) dx and w is the constant function w(x) = 1, 0 < x < 2z. 
Alternatively, take X = R and w the indicator function on (0, 27). The trigonometric func- 
tions 


1 1 1 
go(x) = ——, g(x) = aoe h(x) = a sin(kx), k=1,2,... 


V27 


form a countable infinite-dimensional orthonormal basis of H. E 


A Hilbert space H with an orthonormal basis {f, f2, . . .} behaves very similarly to the 
familiar Euclidean vector space. In particular, every element (i.e., function) f € H can be 
written as a unique linear combination of the basis vectors: 


f=) GWG: (A.33) 


exactly as in Theorem A.3. The right-hand side of (A.33) is called a (generalized) Fourier 
expansion of f. Note that such a Fourier expansion does not require a trigonometric basis; 
any orthonormal basis will do. 


E Example A.13 (Example A.12 (cont.)) Consider the indicator function f(x) = 1{0 < 
x < n}. As the trigonometric functions {g,} and {h,} form a basis for L7((0, 27), 1dx), we 
can write 


o0 


1 1 ee 
f cos(kx) + b,— sin(kx), (A.34) 
A 


(x) = OY a + 2 oe 
where ay = f 1/V2mdx = Vr/2, a, = f; cos(kx)/ yr dx and by = f" sin(kx)/ V7 dx, k = 


0 o 
1,2,.... This means that a; = 0 for all k, by = 0 for even k, and by = 2/(k Vz) for odd k. 


Consequently, 


Dyo OEe), (A35) 


1 
1a are 
m4 
Jj=0 


Figure A.7 shows several Fourier approximations obtained by truncating the infinite sum 
in (A.35). E 





`The function spaces typically encountered in machine learning and data science are usually separable 
spaces, which allows for the set Z to be considered countable; see, e.g., [106]. 
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0 0.5 1 1.5 2 2.5 3 3.5 
Figure A.7: Fourier approximations of the unit step function f on the interval (0, 7), trun- 
cating the infinite sum in (A.35) to i = 2, 4, and 14 terms, giving the dotted blue, dashed 
red, and solid green curves, respectively. 
Starting from any countable basis, we can use the Gram-Schmidt procedure to obtain is 375 
an orthonormal basis, as illustrated in the following example. 
E Example A.14 (Legendre Polynomials) Take the function space L?(R, w(x) dx), where 
w(x) = 1{-1 < x < 1}. We wish to construct an orthonormal basis of polynomial functions 
go, 81, 82, . - „ Starting from the collection of monomials: to, t1, t2, . . .„ where tg : x => x*. Us- 
ing Gram-Schmidt, the first normalized zero-degree polynomial is gp = to/lltoll = 1/2. 
To find gı (a polynomial of degree 1), project 7; (the identity function) onto the space 
spanned by go. The resulting projection is pı := (go, 41)go, Written out as 
1 1 1 
pi(x) = (f X go(x) ax} 80(x) = Ji xdx = 0. 
-1 -1 
Hence, g; = (4; — pı)/lla — pill is a linear function; that is, of the form g;(x) = ax. The 
constant a is found by normalization: 
1 1 2 
1 = ligil? = i g(x) dx = e f Ydr=a z, 
-1 Sj 3 
so that gı(x) = vV3/2x. Continuing the Gram-Schmidt procedure, we find g(x) = 
V5/8(3x" — 1), 83(x) = V7/8(5x° — 3x) and, in general, 
V2k+1 d , 4 
g(x) = TE ae -1)*, k=0,1,2,.... 
These are the (normalized) Legendre polynomials. The graphs of go, g1, 82, and g3 are given 
ed LEGENDRE 
in Figure A.8. POLYNOMIALS 
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Figure A.8: The first 4 normalized Legendre polynomials. 


As the Legendre polynomials form an orthonormal basis of L?(R, 1{-1 < x < 1}dx), 
they can be used to approximate arbitrary functions in this space. For example, Figure A.9 
shows an approximation using the first 51 Legendre polynomials (k = 0,1,...,50) of the 
Fourier expansion of the indicator function on the interval (—1/2, 1/2). These Legendre 
polynomials form the basis of a 51-dimensional linear subspace onto which the indicator 
function is orthogonally projected. 











-1 -0.5 0 0.5 1 


Figure A.9: Approximation of the indicator function on the interval (—1/2, 1/2), using the 
Legendre polynomials go, 21,..., 850- 


The Legendre polynomials were produced in the following way: We started with an 
unnormalized probability density on R — in this case the probability density of the uniform 
distribution on (—1, 1). We then constructed a sequence of polynomials by applying the 
Gram-Schmidt procedure to the monomials 1, x, x”,.... 

By using exactly the same procedure, but with a different probability density, we can 
produce other such orthogonal polynomials. For example, the density of the standard expo- 
nential® distribution, w(x) = e~*, x > 0, gives the Laguerre polynomials, which are defined 
by the recurrence 


(n + Dgn (x) = Qn + 1 = x)gn(x) = ngn- 0), n= 1,2,..., 





This can be further generalized to the density of a gamma distribution. 
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with go(x) = 1 and g(x) = 1 — x, for x > 0. The Hermite polynomials are obtained when HERMITE 
using instead the density of the standard normal distribution: w(x) = eo 2 /W2n, x € R. er 
These polynomials satisfy the recursion 
dgn(x) 
ga) = X82) - E, n= 0,1,..., 
x 

with go(x) = 1, x € R. Note that the Hermite polynomials as defined above have not been 
normalized to have norm 1. To normalize, use the fact that ||g,,||? = n!. 

We conclude with a number of key results in functional analysis. The first one is the 
celebrated Cauchy—Schwarz inequality. CaucHy— 

SCHWARZ 

Theorem A.15: Cauchy—Schwarz 
Proof: The inequality is trivially true for g = 0 (zero function). For g + 0, we can write 
f = ag +h, where h L g and a = (f, 8)/llgl?. Consequently, ||f|I’ = lal? Iig? + I4? > 
la? ||g||?. The result follows after rearranging this last inequality. o 

Let V and W be two linear vector spaces (for example, Hilbert spaces) on which norms 
Il- Ily and ||- |lqy are defined. Suppose A : V — W is a mapping from V to W. When 
W = Y, such a mapping is often called an operator; when W = R it is called a functional. PRETEEN 
Mapping A is said to be linear if A(af + Bg) = «A(f) + BA(g). In this case we write Af FUNCTIONAL, 


instead of A(/). If there exists y < co such that 


IAfllw <yilfllv, fey, (A.36) 


then A is said to be a bounded mapping. The smallest y for which (A.36) holds is called the 
norm of A; denoted by ||A||. A (not necessarily linear) mapping A : V — W is said to be 
continuous at f if for any sequence fi, f2, ... converging to f the sequence A(f,), A(/s),... 
converges to A(f). That is, if 


Ye > 0,36 > 0 : Yg € V, If - glly < ô > IAC) - A(e)llv < £. (A.37) 


If the above property holds for every f € VY, then the mapping A itself is called continuous. 


Theorem A.16: Continuity and Boundedness for Linear Mappings 





Proof: Let A be linear and bounded. We may assume that A is non-zero (otherwise the 
statement holds trivially), and that therefore 0 < ||A|| < co. Taking 6 < &/||A|| in (A.37) now 
ensures that ||A f — Ag|lay < |IAII If — elly < ||Al| 6 < £. This shows that A is continuous. 
Conversely, suppose A is continuous. In particular, it is continuous at f = 0 (the zero- 
element of V). Thus, take f = 0 and let £ and 6 be as in (A.37). For any g #0, let h = 


LINEAR MAPPING 


BOUNDED 
MAPPING 


NORM 


CONTINUOUS 
MAPPING 
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6/(2\lgllv) 2. As ||Ally = 6/2 < ô, it follows from (A.37) that 


6 
lAAllw = 5—_llAsllw < e. 
Aligilv 


Rearranging the last inequality gives ||Ag|lqy < 2¢/d||g|ly, showing that Ais bounded. O 


Theorem A.17: Riesz Representation Theorem 





Proof: Let P be the projection of H onto the nullspace N of ¢; that is, N = {g EH : 
(g) = 0}. If @ is not the 0-functional, then there exists a go # O with (go) + O. Let 
81 = go — Pgo. Then g; L N and ¢(g1) = (go). Take g2 = g1/¢(g1). For any h € H, 
f := h — ġ(h)g2 lies in N. As go L N it holds that (f, g2) = 0, which is equivalent to 
(h, 22) = (h) ||g2||?. By defining g = g>/||g2||> we have found our representation. o 


A.8 Fourier Transforms 


We will now briefly introduce the Fourier transform. Before doing so, we will extend the 
concept of L? space of real-valued functions as follows. 


Definition A.5: L? Space 


Let X be a subset of R? with measure (dx) = w(x) dx and p € [1, œ). Then L’(X, m) 
is the linear space of functions from X to C that satisfy 


1 PEO (A.38) 
X 


When p = 2, L?(X, u) is in fact a Hilbert space equipped with inner product 


(f. 8) = i f(x) g(x) w(x) dx. (A.39) 





We are now in a position to define the Fourier transform (with respect to the Lebesgue 
measure). Note that in the following Definitions A.6 and A.7 we have chosen a particular 
convention. Equivalent (but not identical) definitions exist that include scaling constants 
(27)! or (27)? and where —2zt is replaced with 27t, t, or —t. 
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Definition A.6: (Multivariate) Fourier Transform 





The Fourier transform F [f] of a (real- or complex-valued) function f € L'(R®) is FOURIER 
the function f defined as TR REE 
F(t) := [ ell x ey) dy, teER? 
Ra 
The Fourier transform f is continuous, uniformly bounded (since f € L! (R?) im- 
plies that | f(| < es |f(x)| dx < co), and satisfies lima% f(t) = 0 (a result known as 
the Riemann—Lebesgue lemma). However, LA does not necessarily have a finite integ- 
ral. A simple example in R! is the Fourier transform of f(x) = 1{-1/2 < x < 1/2}. Then 
f(E) = sin(zt)/(at) = sinc(zt), which is not absolutely integrable. 
Definition A.7: (Multivariate) Inverse Fourier Transform 
The inverse Fourier transform F “If of a (real- or complex-valued) function INVERSE FOURIER 
TRANSFORM 


F € L'(R®) is the function f defined as 


fx) := f edie. 
Rd 





As one would hope, it holds that if f and F[f] are both in L! (R), then f = F“'[F[f]] 
almost everywhere. 

The Fourier transform enjoys many interesting and useful properties, some of which 
we list below. 


1. Linearity: For f,g € L'(R“) and constants a,b € R, 


F [af + bg) =aF [f]+bF [gl]. 


2. Space Shifting and Scaling: Let A € R?” be an invertible matrix and b € R? a con- 
stant vector. Let f € L'(R“) and define h(x) := f(Ax + b). Then 


F [h\(t) = et? ~ D t FCAT £)/| det(A)I, 
where A~T := (AT)! = (A71. 


3. Frequency Shifting and Scaling: Let A € R?“ be an invertible matrix and b € Rf a 
constant vector. Let f € L'(R“) and define 


h(x) := e>? AT FAT yx) /| det(A). 
Then F [A](t) = f(At + b). 


4. Differentiation: Let f € L'(R“) N C!(R®) and let f, := Of /Ox;, be the partial derivat- 
ive of f with respect to xg. If fg € L! (RD) for k = 1,...,d, then 


FIFI) = (i 27 ty) f(t). 
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5. Convolution: Let f, g € L! (RÌ) be real or complex valued functions. Their convolu- 
tion, f * g, is defined as 


(F * g)(x) = il f(y) g(x — y)dy, 
Rd 
and is also in L'(R“). Moreover, the Fourier transform satisfies 
Fif*gl=F(f|F (gl. 


6. Duality: Let f and F[f] both be in L'R). Then F[FLfII() = f(t). 


7. Product Formula: Let f, g € L! (R°) and denote by f, g their respective Fourier trans- 
forms. Then f g, f g € L'(R%), and 


1 f(z) g(z) dz = f f(2) g(z) dz. 
Rd R4 


There are many additional properties which hold if f € L! (RJ) N L?R®). In particular, 
if f,g € LIR) N LRD), then f, g € LRS) and (f, g) = (f, 8), a result often known as 
Parseval’s formula. Putting g = f gives the result often referred to as Plancherel’s theorem. 

The Fourier transform can be extended in several ways, in the first instance to functions 
in L?(R*) by continuity. A substantial extension of the theory is realized by replacing integ- 
ration with respect to the Lebesgue measure (i.e., fa --- dx) with integration with respect 
to a (finite Borel) measure u (i.e., fea ---u(dx)). Moreover, there is a close connection 
between the Fourier transform and characteristic functions arising in probability theory. 
Indeed, if X is a random vector with pdf f, then its characteristic function y satisfies 


y(t) := Ee * = F[f](-t/(2n)). 


A.8.1 Discrete Fourier Transform 


Here, we introduce the (univariate) discrete Fourier transform, which can be viewed as a 
special case of the Fourier transform introduced in Definition A.6, where d = 1, integration 
is with respect to the counting measure, and f(x) = 0 for x < 0 and x > (n — 1). 


Definition A.8: Discrete Fourier Transform 


The discrete Fourier transform (DFT) of a vector x = [X0,...,Xn-1]' € C” is the 
vector ¥ = [Xo,...,X,-1]' whose elements are given by 


K A (A.40) 


where w = exp(—i27/n). 





In other words, x is obtained from x via the linear transformation 


x = Fx, 
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where 
1 1 1 fee 1 
1 w GO . a! 
Ù a wt wr) 


F= 


1 ot! YD wr? 
The matrix F is a so-called Vandermonde matrix, and is clearly symmetric (i.e., F = F"). 
Moreover, F/n is in fact a unitary matrix and hence its inverse is simply its complex 


conjugate F/yn. Thus, F`! = F/n and we have that the inverse discrete Fourier transform 
(IDFT) is given by 
1 n-1 _ 
uo) oe, t=0,...,n—-1, (A.41) 
a s=0 
or in terms of matrices and vectors, 
x =FX/n. 


Observe that the IDFT of a vector y is related to the DFT of its complex conjugate y, since 
Fy/n=Fy/n. 


Consequently, an IDFT can be computed via a DFT. 

There is a close connection between circulant matrices C and the DFT. To make this 
connection concrete, let C be the circulant matrix corresponding to the vector c € C” and 
denote by f, the t-th column of the discrete Fourier matrix F, t = 0, 1,...,n — 1. Then, the 
s-th element of Cf, is 


n-1 n-1 n-1 
tk _ > t(s-y) _ ts ) —ty 
C(s—k) modn WD = Cy wW z es Cy Ww”. 
k=0 y=0 s-th element of f, y=0 


As 
Hence, the eigenvalues of C are 
A=e"f, t=0,1,...,n-1, 


with corresponding eigenvectors f, Collecting the eigenvalues into the vector A = 
[Ao,---,4n-1]' = Fe, we therefore have the eigen-decomposition 


C = F diag) F/n. 


Consequently, one can compute the circular convolution of a vector a = [d),...,4n]" 
and c = [co,...,Cn-1]' by a series of DFTs as follows. Construct the circulant matrix C 
corresponding to c. Then, the circular convolution of a and c is given by y = Ca. Proceed 
in four steps: 


1. Compute z = Fa/n. 


2. Compute A = Fe. 


INVERSE DISCRETE 
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3. Compute p = ZOA = [zı Ao, .-. Zn An-11". 
4. Compute y = Fp. 
Steps 1 and 2 are (up to constants) in the form of an IDFT, and step 4 is in the form of a 
1 394 DFT. These are computable via the FFT (Section A.8.2) in O(n Inn) time. Step 3 is a dot 


FAST FOURIER 
TRANSFORM 


product computable in O(n) time. Thus, the circular convolution can be computed with the 
aid of the FFT in O(n Inn) time. 

One can also efficiently compute the product of an n x n Toeplitz matrix T and an n x 1 
vector a in O(n Inn) time by embedding T into a circulant matrix C of size 2nx2n. Namely, 


define 
T B 
c-l a} 
where 
0 tn-1 to ty 
t_(n-1) 0 tn-1 ty 
B= t_(-1) 0 
t2 tn-1 
tı t_2 te egei 0 


Then a product of the form y = Ta can be computed in O(n Inn) time, since we may write 


cl4| = T Bj|ja| _|Ta 
0| |B TIO}  |Baj|’ 
The left-hand side is a product of a 2n X 2n circulant matrix with vector of length 2n, and 
so can be computed in O(n Inn) time via the FFT, as previously discussed. 
Conceptually, one can also solve equations of the form Cx = b for a given vector b € C” 


and circulant matrix C (corresponding to c € C”, assuming all its eigenvalues are non-zero) 
via the following four steps: 


1. Compute z = Fb/n. 

2. Compute A = Fe. 

3. Compute p = 2/A = [z1/20, - - . , Zn/An-1]". 
4. Compute x = Fp. 


Once again, Steps 1 and 2 are (up to constants) in the form of an IDFT, and Step 4 is in 
the form of a DFT, all of which are computable via the FFT in O(n Inn) time, and Step 3 
is computable in O(n) time, meaning the solution x can be computed using the FFT in 
O(nlnn) time. 


A.8.2 Fast Fourier Transform 


The fast Fourier transform (FFT) is a numerical algorithm for the fast evaluation of (A.40) 
and (A.41). By using a divide-and-conquer strategy, the algorithm reduces the compu- 
tational complexity from O(n”) (for the naive evaluation of the linear transformation) to 
O(n Inn) [60]. 
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The essence of the algorithm lies in the following observation. Suppose n = rır2. Then 
one can express any index t appearing in (A.40) via a pair (tọ, t1), with t = tırı + to, 
where fo € {0,1,...,7; — 1} andr, € {0,1,...,7 — 1}. Similarly, one can express any index 
s appearing in (A.40) via a pair (so, 51), with $ = sır2 + Sọ, Where so € {0,1,...,72 — 1} and 
sı E€ {0,1,...,r1— 1}. 

Identifying x, = Xan and x, = Xss) We may re-express (A.40) as 


r-l rı—1 


Taa YO Dw x5, =01,...71-1,4=0,1,...,.-1 AA) 


so=0 s1=0 


Observe that w! = w*!”” (because w™™ = 1), so that the inner sum over sı depends only 
on Sp and tọ. Define 


rı—l 


=) Sirf -_ _ 
Yio,s9 -= ie aia ee to = 0,1,...,7, -1,59 =0,1,...,m-1. 


s1=0 


Computing each Yy, requires O(nr,) operations. In terms of the {Yn,s}, (A.42) can be 
written as 

m-l 
Taa A (OC m THO the urn 


so=0 


requiring O(n r2) operations to compute. Thus, calculating the DFT using this two-step 
procedure requires O(n (rı + r2)) operations, rather than O(n’). 

Now supposing n = r)r2- ++ rm, repeated application the above divide-and-conquer idea 
yields an m-step procedure requiring O(n (rı + r2 + +++ + Fm)) operations. In particular, if 
r; =r for all k = 1,2,...,m, we have that n = r” and m = log, n, so that the total number 
of operations is O(rnm) = O(rn log,(n)). Typically, the radix r is a small (not necessarily 
prime) number, for instance r = 2. 


Further Reading 


A good reference book on matrix computations is Golub and Van Loan [52]. A useful list 
of many common vector and matrix calculus identities can be found in [95]. Strang’s in- 
troduction to linear algebra [116] is a classic textbook, and his recent book [117] combines 
linear algebra with the foundations of deep learning. Fast reliable algorithms for matrices 
with structure can be found in [64]. Kolmogorov and Fomin’s masterpiece on the theory 
of functions and functional analysis [67] still provides one of the best introductions to the 
topic. A popular choice for an advanced course in functional analysis is Rudin [106]. 
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APPENDIX B 





MULTIVARIATE DIFFERENTIATION AND 
OPTIMIZATION 





The purpose of this appendix is to review various aspects of multivariate differen- 
tiation and optimization. We assume the reader is familiar with differentiating a real- 
valued function. 


B.1 Multivariate Differentiation 


For a multivariate function f that maps a vector x = [x,,...,x,]' to a real number f(x), 
the partial derivative with respect to x;, denoted ot is the derivative taken with respect 
to x; while all other variables are held constant. We can write all the n partial derivatives 
neatly using the “scalar/vector” derivative notation: 

of 
ð Ox, 
of aie (B.1) 
Ox af 
OXn 


scalar/vector: 





This vector of partial derivatives is known as the gradient of f at x and is sometimes written 
as V f(x). 
Next, suppose that f is a multivalued (vector-valued) function taking values in R”, 
defined by 
xı fi) 
X2 hx) 
. > . 


' =: f(x). 
Xn Sn(X) 


We can compute each of the partial derivatives 0f;/0x; and organize them neatly in a “vec- 
tor/vector” derivative notation: 











Ox, Ox) Ox, 
af |Z B.. Gin 
vector/vector: ET) al eae (B.2) 
Ox . 2 aie “a 
Of h ,. fm 
OX, Xn OXn 
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MATRIX OF JACOBI 


HESSIAN MATRIX 


BS 357 


The transpose of this matrix is known as the matrix of Jacobi of f at x (sometimes 
called the Fréchet derivative of f at x); that is, 





of, ft oft 
ð. Op 0: 
a (2 Bw 
Ox Ox Ox 
Jx) =|] =|% 5 ap B.3) 
oo Ox f : : 
Ofm fmn Ofm 
Ox, Ox2 OXn 


If we define g(x) := V f(x) and take the “vector/vector” derivative of g with respect to 
x, we obtain the matrix of second-order partial derivatives of f: 

















ef ef Pf 

xı Ox10x2 Ox10Xm 

dg f of f 

2 
H;(x) L ae = =e f (B.4) 

Ox : : 

af Pf af 

OXmOX,  XmðxX2 O2Xm 


which is known as the Hessian matrix of f at x, also denoted as V? f(x). If these second- 


OF OF 








order partial derivatives are continuous in a region around x, then =~— = zy3z and, hence, 
is Seal i Ai l 


the Hessian matrix H (x) is symmetric. 


Finally, note that we can also define a “scalar/matrix” derivative of y with respect to 


X e R”™” with (i, j)-th entry x;;: 




















s ay ay 
Ox Ox OX1n 
an 
oy — | Ox 0x22 OX2n 
OX : : 
oy oy oy 
OXm1 OXm2 OXmn 
and a “matrix/scalar” derivative: 
ðxu Ox12 xin 
oy oy oy 
ðX xa x2r 
Ox . | 8 dy dy 
Oy . : . 
OXm1 OXm2 OXmn 
oy oy oy 


E Example B.1 (Scalar/Matrix Derivative) Let y = a' Xb, where X € R”, a € R”, 
and b € R”. Since y is a scalar, we can write y = tr(y) = tr(Xba' ), using the cyclic property 
of the trace (see Theorem A.1). Defining C := ba’, we have 


y= YXCh = z z Xj jC jis 
i=l i=l j=l 


so that dy/Ox;; = cj Or, in matrix form, 


Oy 


aX C ab’. 
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E Example B.2 (Scalar/Matrix Derivative via the Woodbury Identity) Lety = tr (XA), 
where X, A € R”. We now prove that 


dy 
—=-X'A'X™. 
ox 
To show this, apply the Woodbury matrix identity to an infinitesimal perturbation, X + £U, 
of X, and take € | 0 to obtain the following: rs 371 
X+eU)!-X"! 
oh eee ere Ud+eX'!U)'X! — -x'!UXx", 
E 


Therefore, as e | 0 


tr((X+eU)'!A)-tr(X"'A 
Reena) — -tr(x'U XA) = -tr(UX'AX"). 
E 

Now, suppose that U is an all zero matrix with a one in the (i, j)-th position. We can write, 


D in tr((X + eU) 1A) - tr(X~'A) 


Oxi; el0 E€ 


= -tr (UX"'AX"') =- [XTAX"] |. 
Therefore, 2 = — (x7Ax"!)'. 


> ax E 


The following two examples specify multivariate derivatives for the important special 
cases of linear and quadratic functions. 


m@ Example B.3 (Gradient of a Linear Function) Let f(x) = Ax for some mxn constant 
matrix A. Then, its vector/vector derivative (B.2) is the matrix 


of 
—=A', 
Ox 
To see this, let a;; denote the (i, j)-th element of A, so that 


(B.5) 


Deel AX 
f(x) = Ax = 


Dk=t AmkXk 


To find the (j, 7)-th element of of we differentiate the i-th element of f with respect to x;: 


of, ô x 
of = — > anx = dij. 


Ox j 7 Ox j i 
In other words, the (i, j)-th element of A is aj, the (i, j)-th element of A". E 


E Example B.4 (Gradient and Hessian of a Quadratic Function) Let f(x) = x' Ax for 
some n x n constant matrix A. Then, 


V f(x) = (A +A')x. (B.6) 
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COMPOSITION 


It follows immediately that if A is symmetric, that is, A = A‘, then V(x" Ax) = 2Ax and 
V? (x'Ax) = 2A. 

To prove (B.6), first observe that f(x) = x" Ax = Di) Yi) aij XiXj, which is a quadratic 
form in x, is real-valued, with 


n n n n 
= đijXiXj = > AkjXj ) AiXj- 
Ox, OX i=l j=l j=l i=l 


The first term on the right-hand side is equal to the k-th element of Ax, whereas the second 
term equals the k-th element of x' A, or equivalently the k-th element of A‘ x. a 


B.1.1 Taylor Expansion 


The matrix of Jacobi and the Hessian matrix feature prominently in multidimensional 
Taylor expansions. 


Theorem B.1: Multidimensional Taylor Expansions 





The result is essentially saying that a smooth enough function behaves locally (in the 
neighborhood of a point x) like a linear and quadratic function. Thus, the gradient or Hes- 
sian of an approximating linear or quadratic function is a basic building block of many 
approximation and optimization algorithms. 


E Remark B.1 (Version Without Remainder Terms) An alternative version of Taylor’s 
theorem states that there exists an a’ that lies on the line segment between x and a such 
that (B.7) and (B.8) hold without remainder terms, with J;(a) in (B.7) replaced by J;(a’) 
and H;(a) in (B.8) replaced by H;(a’). E 


B.1.2 Chain Rule 


Consider the functions f : R” > R” and g : R” > R*. The function x => g(f(x)) is called 
the composition of g and f, written as g o f, and is a function from R” to R*. Suppose 
y = f(x) and z = g(y), as in Figure B.1. Let J¢(x) and J,(y) be the (Fréchet) derivatives 
of f (at x) and g (at y), respectively. We may think of J (x) as the matrix that describes 
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how, in a neighborhood of x, the function f can be approximated by a linear function: 

f(x +h) = f(x) + Jp(x)h, and similarly for Jg(y). The well-known chain rule of calculus CHAIN RULE 
simply states that the derivative of the composition g o f is the matrix product of the 

derivatives of g and f; that is, 


J gof(X) = Jey) J f(x). 


gof 
bo 
ea ee Oe 
R* R” R” 


Figure B.1: Function composition. The blue arrows symbolize the linear mappings. 


In terms of our vector/vector derivative notation, we have 
az T E ôz T dy T 
Ox}  |dy| | dx 


OZ _ ðy OZ 
ðx Ox dy” 
In a similar way we can establish a scalar/matrix chain rule. In particular, suppose X is 
an n x p matrix, which is mapped to y := Xe for a fixed p-dimensional vector œ. In turn, y 


is mapped to a scalar z := g(y) for some function g. Denote the columns of X by x1,...,Xp. 
Then, 





or, more simply, 


(B.9) 


p 
y = Xa = >) aX; 
j=l 


and, therefore, dy/Ox; = ajI,. It follows by the chain rule (B.9) that 





Oz ôy Oz Oz Oz 
= = aa = Ain FO = Viz: 
Ox; Ox; Oy oy oy 
Therefore, r 
Zz ð ð ð az Oz 
ox = lire tee] = [215 | = ye (B.10) 


E Example B.5 (Derivative of the Log-Determinant) Suppose we are given a positive 
definite matrix A € R?*? and wish to compute the scalar/matrix derivative a . The result 
is 





Oln|A| 
OA 
To see this, we can reason as follows. By Theorem A.7, we can write A = Q DQ", where IS 366 


=A’l, 
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OBJECTIVE 
FUNCTION 


LOCAL MINIMIZER 


GLOBAL 
MINIMIZER 


LOCAL/GLOBAL 
MINIMUM 


Q is an orthogonal matrix and D = diag(44, . . . , 4p) is the diagonal matrix of eigenvalues of 
A. The eigenvalues are strictly positive, since A is positive definite. Denoting the columns 
of Q by (q;), we have 


A, = qi Aq; = tr (q;Aq;), i=1,...,p. (B.11) 


From the properties of determinants, we have y := 1n |A| = 1n |Q D Q'| = In(|Q| IDI |Q") = 
In |D] = >, In 4;. We can thus write 











= = &: dnd; dA dINA, AA; 1 
L AN ð L ðA A 

where the second equation follows from the chain rule applied to the function composition 

At A; y. From (B.11) and Example B.1 we have ĝ4;/ôA = q;q; . It follows that 


Oy P 71 -1 - 
Z g—=-OD'O' =A". 
A 2 9:9; l Q Q 


B.2 Optimization Theory 


Optimization is concerned with finding minimal or maximal solutions of a real-valued 
objective function f in some set X: 


min f(x) or max f(x). (B.12) 


Since any maximization problem can easily be converted into a minimization problem via 
the equivalence max, f(x) = — minx — f(x), we focus only on minimization problems. We 
use the following terminology. A local minimizer of f(x) is an element x* € X such that 
f(x*) < f(x) for all x in some neighborhood of x*. If f(x*) < f(x) for all x € X, then x* is 
called a global minimizer or global solution. The set of global minimizers is denoted by 


argmin f(x). 
xEX 

The function value f(x*) corresponding to a local/global minimizer x* is referred to as the 
local/global minimum of f(x). 

Optimization problems may be classified by the set X and the objective function f. 
If X is countable, the optimization problem is called discrete or combinatorial. If instead 
X is a nondenumerable set such as R” and f takes values in a nondenumerable set, then 
the problem is said to be continuous. Optimization problems that are neither discrete nor 
continuous are said to be mixed. 

The search set X is often defined by means of constraints. A standard setting for con- 
strained optimization (minimization) is the following: 


min f(x) 
xeY 
subject to: h(x) =0, i=1,...,m, (B.13) 


CVS. i=1,...,k. 
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Here, f is the objective function, and {g;} and {h;} are given functions so that h(x) = 0 
and g(x) < 0 represent the equality and inequality constraints, respectively. The region 
X C Y where the objective function is defined and where all the constraints are satisfied 
is called the feasible region. An optimization problem without constraints is said to be an 
unconstrained problem. 

For an unconstrained continuous optimization problem, the search space X is often 
taken to be (a subset of) R”, and f is assumed to be a C* function for sufficiently high 
k (typically k = 2 or 3 suffices); that is, its k-th order derivative is continuous. For a C! 
function the standard approach to minimizing f(x) is to solve the equation 


V f(x) =0, (B.14) 


where V f(x) is the gradient of f at x. The solutions x* to (B.14) are called station- 
ary points. Stationary points can be local/global minimizers, local/global maximizers, or 
saddle points (which are neither). If, in addition, the function is C?, the condition 


y'(V-f(x*))y>0 forally #0 (B.15) 


ensures that the stationary point x* is a local minimizer of f. The condition (B.15) states 
that the Hessian matrix of f at x* is positive definite. Recall that we write H > 0 to indicate 
that a matrix H is positive definite. 

In Figure B.2 we have a multiextremal objective function on X = R. There are four 
stationary points: two are local minimizers, one is a local maximizer, and one is neither a 
minimizer nor a maximizer, but a saddle point. 






Local maximum 


| 






Saddle point 


| 





Local minimum 


f Global minimum 





Figure B.2: A multiextremal objective function in one dimension. 


B.2.1 Convexity and Optimization 


An important class of optimization problems is related to the notion of convexity. A set X 
is said to be convex if for all x1, x2 € X it holds that æ xı + (1 — æ) x2 € X forallO <a <1. 

In addition, the objective function f is a convex function provided that for each x in the 
interior of X there exists a vector v such that 


f(y) > f(x)+Q-x)'v, yeXx. (B.16) 
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The vector v in (B.16) may not be unique and is referred to as a subgradient of f. 
One of the crucial properties of a convex function f is that Jensen’s inequality holds 
(see Exercise 14 in Chapter 2): 
Ef(X) > f(EX), 


for any random vector X. 
E Example B.6 (Convexity and Directional Derivative) The directional derivative of a 


multivariate function f at x in the direction d is defined as the right derivative of g(t) := 
f(x +td)att=0: 


lim f(x + td) — f(x) 
no t 





= lim ¢ (f(x + d/t) — f(x)). 


This right derivative may not always exist. However, if f is a convex function, then the 
directional derivative of f at x in the interior of its domain always exists (in any direction 
d). 

To see this, let t; > t2 > 0. By Jensen’s inequality we have for any x and y in the interior 
of the domain: 


2py)+(1- 2] ræ > s(2y +(1- 2), 
ti 1 ti ti 


t 
Making the substitution y = x + tıd and rearranging the last equation yields: 


fætnd- fix) fle + ted) fo 


ti h 
In other words, the function t =œ (f(x + td) -— f(x))/t is increasing for t > 0 and therefore 
the directional derivative satisfies: 
. f(x+td)—f(x) . fœ +td)- fx) 
lim ————_ = inf =, 
no t t>0 t 


Hence, to show existence it is enough to show that (f(x + td) — f(x))/t is bounded from 
below. 

Since x lies in the interior of the domain of f, we can choose t small enough so that 
x + td also lies in the interior. Therefore, the convexity of f implies that there exists a 
subgradient vector v such that f(x + td) > f(x) + v' (td). In other words, 


fetid fw spa 


provides a lower bound for all t > 0, and the directional derivative of f at an interior x 
always exists (in any direction). E 


A function f satisfying (B.16) with strict inequality is said to be strictly convex. It is 
said to be a (strictly) concave function if —f is (strictly) convex. Assuming that X is an 
open set, convexity for f € C! is equivalent to 


fQ = f(x) +(y-x)'Vf(x) forall x,y €X. 


Moreover, for f € C? strict convexity is equivalent to the Hessian matrix being positive 
definite for all x € X, and convexity is equivalent to the Hessian matrix being positive 
semidefinite for all x; that is, y" (v? f (x)) y > 0 for all y and x. Recall that we write H > 0 
to indicate that a matrix H is positive semidefinite. 
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E Example B.7 (Convexity and Differentiability) If f is a continuously differentiable 
multivariate function, then f is convex if and only if the univariate function 


g(t) := f(x+td), te[0,1] 


is a convex function for any x and x + d in the interior of the domain of f. This property 
provides an alternative definition for convexity of a multivariate and differentiable function. 

To see why it is true, first assume that f is convex and t,t} € [0,1]. Then, using 
the subgradient definition of convexity in (B.16), we have f(x + t, d) > f(x) +t,v'd and 
f(x +t, d) > f(x) +tv'd for some subgradient v. Subtracting the last two equations we 
obtain 


gt) > g(t) + (ti —h)v'd 


for any two points t),f € [0,1]. Therefore, g is convex, because we have identified the 
existence of a subgradient v' d for each t. 

Conversely, assume that g is convex for t € [0, 1]. Since f is differentiable, then so is g. 
Then, the convexity of g implies that there is a subgradient v at O such that: g(t) > g(0)+tv 
for all t € [0, 1]. Rearranging, 


os g(t) — g(0) 
t 


and taking the right limit as t | 0 we obtain v > g’(0) = d' V f(x). Therefore, 
g(t) > g(0) +tv > g(0) + td'Vf(x) 
and substituting t = 1 yields: 
f(x +d) > f(x) + d'Vf(x), 


so that there exists a subgradient vector, namely V f(x), for each x. Hence, f is convex by 
the definition in (B.16). Oo 


An optimization program of the form (B.13) is said to be a convex programming prob- 
lem if: 


1. The objective f is a convex function. 
2. The inequality constraint functions {g;} are convex. 


3. The equality constraint functions {h;} are affine, that is, of the form a} x — b;. This is 
equivalent to both h; and —h; being convex for all i. 


Table B.1 summarizes some commonly encountered problems, all of which are convex, 
with the exception of the quadratic programs with A # 0. 
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Table B.1: Some common classes of optimization problems. 


Name f(x) Constraints 





Linear Program (LP) c'x Ax = bandx 20 





Inequality Form LP c'x Ax <b 


Quadratic Program (QP) | $x'Ax+b'x | Dx <d, Ex=e 








Convex QP sx"Ax+b'x | Dx<d, Ex=e (A>0) 


Convex Program f(x) convex {gi(x)} convex, {h;(x)} of the form a} x — b; 








Recognizing convex optimization problems or those that can be transformed to convex 
optimization problems can be challenging. However, once formulated as convex optimiz- 
ation problems, these can be efficiently solved using subgradient [112], bundle [57], and 
cutting-plane methods [59]. 


B.2.2 Lagrangian Method 


The main components of the Lagrangian method are the Lagrange multipliers and the 
Lagrange function. The method was developed by Lagrange in 1797 for the optimization 
problem (B.13) with only equality constraints. In 1951 Kuhn and Tucker extended Lag- 
range’s method to inequality constraints. Given an optimization problem (B.13) containing 
only equality constraints h(x) = 0, i = 1,...,m, the Lagrange function is defined as 


L(x, B) = f(x) + X Bihi), 
i=] 


where the coefficients {6;} are called the Lagrange multipliers. A necessary condition for a 
point x* to be a local minimizer of f(x) subject to the equality constraints h(x) = 0, i = 
1,...,m, is 


V, L(x", B) = 0, 

Ve Lx’, B’) = 0, 
for some value 8". The above conditions are also sufficient if L(x, B*) is a convex function 
of x. 


Given the original optimization problem (B.13), containing both the equality and in- 
equality constraints, the generalized Lagrange function, or Lagrangian, is defined as 


k m 
L(x, a, p) = f(x) + >» qi gi(x) + XA: h,(x). 
i=l i=l 
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Theorem B.2: Karush—Kuhn-Tucker (KKT) Conditions 





For convex programs we have the following important results [18, 43]: 


1. Every local solution x* to a convex programming problem is a global solution and 
the set of global solutions is convex. If, in addition, the objective function is strictly 
convex, then any global solution is unique. 


2. For a strictly convex programming problem with C! objective and constraint func- 
tions, the KKT conditions are necessary and sufficient for a unique global solution. 


B.2.3 Duality 


The aim of duality is to provide an alternative formulation of an optimization problem 
which is often more computationally efficient or has some theoretical significance (see [43, 
Page 219]). The original problem (B.13) is referred to as the primal problem whereas the 
reformulated problem, based on Lagrange multipliers, is called the dual problem. Duality 
theory is most relevant to convex optimization problems. It is well known that if the primal 
optimization problem is (strictly) convex then the dual problem is (strictly) concave and 
has a (unique) solution from which the (unique) optimal primal solution can be deduced. 

The Lagrange dual program (also called the Wolfe dual) of the primal program (B.13), 
is: 


max L*(@, B) 
subject to: œ> 0, 
where L* is the Lagrange dual function: 


L (a, B) = inf L(x, a, B), (B.17) 


giving the greatest lower bound (infimum) of L(x, œ, P) over all possible x € X. 

It is not difficult to see that if f* is the minimal value of the primal problem, then 
L*(a,B) < f* for any a > 0 and any $. This property is called weak duality. The Lag- 
rangian dual program thus determines the best lower bound on f*. If d* is the optimal 
value for the dual problem then d* < f*. The difference f* — d* is called the duality gap. 

The duality gap is extremely useful for providing lower bounds for the solutions of 
primal problems that may be impossible to solve directly. It is important to note that for 
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linearly constrained problems, if the primal is infeasible (does not have a solution satisfying 
the constraints), then the dual is either infeasible or unbounded. Conversely, if the dual 
is infeasible then the primal has no solution. Of crucial importance is the strong duality 
theorem, which states that for convex programs (B.13) with linear constrained functions h; 
and g; the duality gap is zero, and any x* and (a, f°) satisfying the KKT conditions are 
(global) solutions to the primal and dual programs, respectively. In particular, this holds for 
linear and convex quadratic programs (note that not all quadratic programs are convex). 

For a convex primal program with C! objective and constraint functions, the Lagrangian 
dual function (B.17) can be obtained by simply setting the gradient (with respect to x) of 
the Lagrangian L(x, a, B) to zero. One can further simplify the dual program by substitut- 
ing into the Lagrangian the relations between the variables thus obtained. 

Further, for a convex primal problem, if there is a strictly feasible point x (that is, a 
feasible point satisfying all of the inequality constraints with strict inequality), then the 
duality gap is zero, and strong duality holds. This is known as Slater’s condition [18, Page 
226]. 

The Lagrange dual problem is an important example of a saddle-point problem or min- 
imax problem. In such problems the aim is to locate a point (x*, y*) € X x Y that satisfies 
sup inf f(x,y) = inf f(x,y") = f(x,y") = sup f(x", y) = inf sup f(x, y). 

yey xEX xEX yey xEX yey 
The equation 
sup inf f(x,y) = inf sup f(x, y) 
yey xEeX xEX yey 
is known as the minimax equality. Other problems that fall into this framework are zero- 
sum games in game theory; see also [24] for a number of combinatorial optimization prob- 
lems that can be viewed as minimax problems. 


B.3 Numerical Root-Finding and Minimization 


In order to minimize a C! function f : R” > R one may solve 
V f(x) = 0, 


which gives a stationary point of f. As a consequence, any technique for root-finding can 
be transformed into an unconstrained optimization method by attempting to locate roots 
of the gradient. However, as noted in Section B.2, not all stationary points are minima, 
and so additional information (such as is contained in the Hessian, if f is C?) needs to be 
considered in order to establish the type of stationary point. 

Alternatively, a root of a continuous function g : R” — R” may be found by minimizing 
the norm of g(x) over all x; that is, by solving min, f(x), with f(x) := ||lg(x)||,, where for 
p > 1 the p-norm of y = [y1,...,y,]' is defined as 


ivi = (Sob) 
i=] 


Hence, any (un)constrained optimization method can be transformed into a technique for 
locating the roots of a function. 
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Starting with an initial guess x9, most minimization and root-finding algorithms create 
a sequence Xo, X1, ... using the iterative updating rule: 


X41 =x,+a;d, f=0, 1.30, (B.18) 


where œ, > 0 is a (typically small) step size, called the learning rate, and the vector d, 
is the search direction at step t. The iteration (B.18) continues until the sequence {x;,} is 
deemed to have converged to a solution, or a computational budget has been exhausted. 
The performance of all such iterative methods depends crucially on the quality of the initial 
guess Xo. 

There are two broad categories of iterative optimization algorithms of the form (B.18): 


e Those of line search type, where at iteration t we first compute a direction d, and 
then determine a reasonable step size a, along this direction. For example, in the 
case of minimization, a, > 0 may be chosen to approximately minimize f(x, + a d;) 
for fixed x; and d,. 


e Those of trust region type, where at each iteration t we first determine a suitable step 
size œ, and then compute an approximately optimal direction d,. 


In the following sections, we review several widely-used root-finding and optimization 
algorithms of the line search type. 


B.3.1 Newton-Like Methods 


Suppose we wish to find roots of a function f : R” > R”. If f is in C!, we can approximate 
f around a point x, as 


f(x) = fx) + Jaxx - x), 


where Jp is the matrix of Jacobi — the matrix of partial derivatives of f; see (B.3). When 
J f(x) is invertible, this linear approximation has root x, — Jj! (x) f(x). This gives the 
iterative updating formula (B.18) for finding roots of f with direction d; = -Jp (x) f(x 
and learning rate a, = 1. This is known as Newton’s method (or the Newton—Raphson 
method) for root-finding. 

Instead of a unit learning rate, sometimes it is more effective to use an aq, that satisfies 
the Armijo inexact line search condition: 


If (x, + a, doll <  — a a) WIFI, 


where s; is a small heuristically chosen constant, say ¢; = 10~*. For C! functions, such an 
a, always exists by continuity and can be computed as in the following algorithm. 
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Algorithm B.3.1: Newton—Raphson for Finding Roots of f(x) = 0 
input: An initial guess x and stopping error € > 0. 
output: The approximate root of f(x) = 0. 

1 while ||f(x)|| > £ and budget is not exhausted do 





2 Solve the linear system J (x) d = -f (x). 

3 acl 

4 while || f(x + ad)|| > (1 — 107) || f(x)|| do 
5 lL a<a/2 

6 | xex+ ad 

7 return x 


We can adapt a root-finding Newton-like method in order to minimize a differentiable 
function f : R" —> R. We simply try to locate a zero of the gradient of f. When f is a 
C? function, the function Vf : R” — R” is continuous, and so the root of Vf leads to the 
search direction 

d, = -H;' Vf(x), (B.19) 


where H, is the Hessian matrix at x; (the matrix of Jacobi of the gradient is the Hessian). 
When the learning rate a; is equal to 1, the update x, — H7! Vf(x;) can alternatively be 
derived by assuming that f(x) is approximately quadratic and convex in the neighborhood 
of x,, that is, 


1 
f(x) = f(x) + (x — x) V f(x) + ics — x) Hx - x;), (B.20) 


and then minimizing the right-hand side of (B.20) with respect to x. 

The following algorithm uses an Armijo inexact line search for minimization and 
guards against the possibility that the Hessian may not be positive definite (that is, its 
Cholesky decomposition does not exist). 


Algorithm B.3.2: Newton—Raphson for Minimizing f(x) 
input: An initial guess x; stopping error € > 0; line search parameter £ € (0, 1). 
output: An approximate minimizer of f(x). 

1 L <I, (the identity matrix) 

2 while ||V f(x)|| > £ and budget is not exhausted do 


3 Compute the Hessian H at x. 

4 if H > 0 then // Cholesky is successful 
5 | Update L to be the Cholesky factor satisfying LL’ = H. 
6 else 

7 [i Do not update the lower triangular L. 

8 d — -L'Y f(x) (computed by forward substitution) 

9 d — L~'d (computed by backward substitution) 
10 acl 
u | while f(x + ad) > f(x) +a10“*Vf(x)'d do 

12 |i. «axt 
13 xext+ad 





14 return x 
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A downside with all Newton-like methods is that at each step they require the calcu- 
lation and inversion of an n x n Hessian matrix, which has computing time of O(n°), and 
is thus infeasible for large n. One way to avoid this cost is to use quasi-Newton methods, 
described next. 


B.3.2 Quasi-Newton Methods 


The idea behind quasi-Newton methods is to replace the inverse Hessian in (B.19) at iter- 
ation ¢ by an n X n matrix C satisfying the secant condition: 


Cg =5, (B.21) 


where 6 — x,—x,_; and g — V f(x) — Vf(x;-1) are vectors stored in memory at each iter- 
ation t. The secant condition is satisfied, for example, by the Broyden’s family of matrices: 


1 
A+ —(6-Ag)u" 
ug 


for some u + 0 and A. Since there is an infinite number of matrices that satisfy the condi- 
tion (B.21), we need a way to determine a unique C at each iteration ¢ such that computing 
and storing C from one step to the next is fast and avoids any costly matrix inversion. The 
following examples illustrate how, starting with an initial guess C = I at t = 0, sucha 
matrix C can be efficiently updated from one iteration to the next. 


E Example B.8 (Low-Rank Hessian Update) The quadratic model (B.20) can be 
strengthened by further assuming that exp(—f(x)) is proportional to a probability density 
that can be approximated in the neighborhood of x, by the pdf of the N (x1, H7’) dis- 
tribution. This normal approximation allows us to measure the discrepancy between two 
pairs (x1, Ho) and (x2, H,) using the Kullback—Leibler divergence between the pdfs of the 
N(x, Ho!) and N(x2, H7’) distributions (see Exercise 4 on page 350): 


1 
D(x, Hp! |x2,H7') := 5 (tr(H,H,') — In {Hy H'| + (x2 — x1) "Hi (x2 — x1) - n) . (B.22) 


Suppose that the latest approximation to the inverse Hessian is C and we wish to com- 
pute an updated approximation for step t. One approach is to find the symmetric matrix 
that minimizes its Kullback—Leibler discrepancy from C, as defined above, subject to the 
constraint (B.21). In other words, 


min DO, C|0, A) 
subject to: Ag = 6, A = A". 


The solution to this constrained optimization (see Exercise 10 on page 352) yields the 
Broyden—Fletcher—Goldfarb-Shanno or BFGS formula for updating the matrix C from one 
iteration to the next: 


T a 
g'Ö+g'CI oor _ 


=C 
Coras + gb» 


1 T T T 


BFGS update 


SECANT 
CONDITION 


BFGS FORMULA 


412 


B.3. Numerical Root-Finding and Minimization 





DFP FORMULA 


rs 403 


STEEPEST 
DESCENT 


ES 389 


In a practical implementation, we keep a single copy of C in memory and apply the BFGS 
update to it at every iteration. Note that if the current C is symmetric, then so is the updated 
matrix. Moreover, the BFGS update is a matrix of rank two. 

Since the Kullback—Leibler divergence is not symmetric, it is possible to flip the roles 
of Ho and H; in (B.22) and instead solve 


min D(0, A|0, C) 
subject to: Ag = 6, A = A". 


The solution (see Exercise 10 on page 352) gives the Davidon—Fletcher—Powell or DFP 
formula for updating the matrix C from one iteration to the next: 


_ 88) Cag" 
gd g'Cg ` 
— —__—_— 


DFP update 


Core = C (B.24) 


Note that if the curvature condition g' > 0 holds and the current C is symmetric positive 
definite, then so is its update. E 


E Example B.9 (Diagonal Hessian Update) The original BFGS formula requires O(n’) 
storage and computation, which may be unmanageable for large n. One way to circumvent 
the prohibitive quadratic cost is to only store and update a diagonal Hessian matrix from 
one iteration to the next. If C is diagonal, then we may not be able to satisfy the secant 
condition (B.21) and maintain positive definiteness. Instead the secant condition (B.21) 
can be relaxed to the set of inequalities g > C~'6, which are related to the definition of a 
subgradient for convex functions. We can then find a unique diagonal matrix by minimizing 
D(x;,C|X41,A) with respect to A and subject to the constraints that Ag > 6 and A is 
diagonal. The solution (Exercise 15 on page 352) yields the updating formula for a diagonal 
element c; of C: 


2¢; . 2¢; 
es, fe ee 8:/9i 
ci 4 1+ 4/1 + 4eu? 1+ ./1+4cju? (B.25) 
Ô;i/Ji, otherwise, 
where u := V f(x,) and we assume a unit learning rate: x,,; = x, — Au. Oo 


E Example B.10 (Scalar Hessian Update) If the identity matrix is used in place of the 
Hessian in (B.19), one obtains steepest descent or gradient descent methods, in which the 
iteration (B.18) reduces to x,,; = x; — a; Vf(x;). 

The rationale for the name steepest descent is as follows. If we start from any point 
x and make an infinitesimal move in some direction, then the function value is reduced 
by the largest magnitude in the (unit norm) direction: u* := -V f(x)/||V f(x)||. This is seen 
from the following inequality for all unit vectors u (that is, ||w|| = 1): 


d . d 
qe +tu I < al + tu) o 


Observe that equality is achieved if and only if u = u*. This inequality is an easy con- 
sequence of the Cauchy—Schwarz inequality: 
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-Vflu<|Vf"ul < lel IV All = IV fll = -V fru". 


Cauchy—Schwartz 


The steepest descent iteration, x,,; = x, — a,V f(x), still requires a suitable choice of the 
learning rate a,. An alternative way to think about the iteration is to assume that the learning 
rate is always unity, and that at each iteration we use an inverse Hessian matrix of the form 
al for some positive constant a,. Satisfying the secant condition (B.21) with a matrix of the 
form C = alis not possible. However, it is possible to choose a so that the secant condition 
(B.21) is satisfied in the direction of g (or alternatively 6). This gives the Barzilai-Borwein 





formulas for the learning rate at iteration t: BarRZILAI— 
BorwEIN 
T. ô ô 2 FORMULAS 
a= g z (or alternatively a; = Lill (B.26) 
Ilgll og 
E 


B.3.3 Normal Approximation Method 


Let Pn- (x — x1) denote the pdf of the N(x,41, H;') distribution. As we already saw in Ex- 
ample B.8, the quadratic approximation (B.20) of f in the neighborhood of x, is equivalent 
(up to a constant) to the minus of the logarithm of the pdf Pui (x — X1). In other words, 
we use py- (x — X1) as a simple model for the density 


exp(-f(a)) | | expo ay: 


One consequence of the normal approximation is that for x in the neighborhood of x;41, 
we can write: 


—V f(x) x - In gg (x = X141) = -H,(x — x1). 
In other words, using the fact that H; = H,, 
VEVO = A(x = xr) = x1) H;, 
and taking expectations on both sides with respect to X ~ N(x,,;, H;') gives: 
EVA(X)[VA(X)] ~ H,. 


This suggests that, given the gradient vectors computed in the past h (where h stands for 
history) of Newton iterations: 


u; := Vf(x), i=t-(h-l1),...,t, 


the Hessian matrix H, can be approximated via the average 


t 


; > Uiu; . 


i=t-h+1 


A shortcoming of this approximation is that, unless h is large enough, the Hessian approx- 
imation i-p, Uiu] may not be full rank and hence not invertible. To ensure that the 
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Gauss—NEWTON 


Hessian approximation is invertible, we add a suitable diagonal matrix Ao to obtain the 
regularized version of the approximation: 


t 


1 
H, ~ Ao + n J Uiu; . 


i=t-h+1 


With this full-rank approximation of the Hessian, the Newton search direction in (B.19) 
becomes: 


= 

1 t 

d, = -|a B >»; ua] Uy. (B.27) 
i=t-h+1 


Thus, d, can be computed in O(h? n) time via the Sherman—Morrison Algorithm A.6.1. 
Further to this, the search direction (B.27) can be efficiently updated to the next one: 


-1 


1 t+1 
da = -[Ao+ i > ua) Ui41 


i=t-h+2 


in O(hn) time, thus avoiding the usual O(h? n) cost (see Exercise 6 on page 351). 


B.3.4 Nonlinear Least Squares 


Consider the squared-error training loss in nonlinear regression: 
1 n 
LABEL) = = X (gail B) — yi, 
= 


where g(-|) is a nonlinear prediction function that depends on the parameter $ (for ex- 
ample, (5.29) shows the nonlinear logistic prediction function). The training loss can be 
written as *llg(T |B) — yl’, where g(t|B) := [g(x1|B),.-.,2(Xn|B)]" is the vector of out- 
puts. 

We wish to minimize the training loss in terms of p. In the Newton-like methods in 
Section B.3.1, one derives an iterative minimization algorithm that is inspired by a Taylor 
expansion of ¢,(g(-|)). Instead, given a current guess $, we can consider the Taylor ex- 
pansion of the nonlinear prediction function g: 


g(t |B) ~ g(T|B,) + GB - B,), 


where G, := J,(B,) is the matrix of Jacobi of g(t|B) at B,. Denoting the residual e, := 
g(t |B,) — y and replacing g(t |£) with its Taylor approximation in f,(g(-|8)), we obtain 
the approximation to the training loss in the neighborhood of £,: 


1 
£.(g(-1B)) = = [|G - 8) + eiff. 


The minimization of the right-hand side is a linear least-squares problem and therefore 
d, := B — P, satisfies the normal equations: G/ G,d, = G} (—e,). Assuming that G; G, is 
invertible, the normal equations yield the Gauss—Newton search direction: 


d, = -(G7 G) 'G7e,. 
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Unlike the search direction (B.19) for Newton-like algorithms, the search direction of a 
Gauss—Newton algorithm does not require the computation of a Hessian matrix. 
Observe that in the Gauss—Newton approach we determine d, by viewing the search 
direction as coefficients in a linear regression with feature matrix G, and response —e,. This 
suggests that instead of using a linear regression, we can compute d, via a ridge regression 
with a suitable choice for the regularization parameter y: rs 217 
d, = -(G/G, + nyl,) 'G; e;. 
If we replace nI, with the diagonal matrix diag(G; G,), we then obtain the Levenberg— 
Marquardt search direction: LEVENBERG— 
MARQUARDT 


d, = -(G; G, + ydiag(G; G,))"'G/ e,. (B.28) 


Recall that the ridge regularization parameter y has the following effect on the least-squares 
solution: When it is zero, then the solution d, coincides with the search direction of the 
Gauss—Newton method, and when y tends to infinity, then ||d,|| tends to zero. Thus, y 
controls both the magnitude and direction of vector d,. A simple version of the Levenberg— 
Marquardt algorithm is the following. 


Algorithm B.3.3: Levenberg—Marquardt for Minimizing || g(t |B) -yl 
input: An initial guess Bọ; stopping error £ > 0; training set T. 
output: An approximate minimizer of || g(t |B) —yIl’. 


1 t — 0 and y e 0.01 (or another default value) 
2 while stopping condition is not met do 

3 Compute the search direction d, via (B.28). 
4 | emi — g(t|B,+d,)-y 

5 | if llessill < lle;|| then 

6 | | yey/l0, emen By -B, +4, 
7 else 

8 | y yx10 

9 tet+l1 





10 return £, 


B.4 Constrained Minimization via Penalty Functions 


A constrained optimization problem of the form (B.13) can sometimes be reformulated as a 
simpler unconstrained problem — for example, the unconstrained set Y can be transformed 
to the feasible region X of the constrained problem via a function @ : R” — R” such that 
X = @(Y). Then, (B.13) is equivalent to the minimization problem 


mn F(PY)), 


in the sense that a solution x* of the original problem is obtained from a transformed 
solution y* via x* = $(y*). Table B.2 lists some examples of possible transformations. 
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Table B.2: Some transformations to eliminate constraints. 


Constrained Unconstrained 
x>0 exp(y) 

x20 y? 

axx<b a+(b-a sin’) 





Unfortunately, an unconstrained minimization method used in combination with these 
transformations is rarely effective. Instead, it is more common to use penalty functions. 

The overarching idea of penalty functions is to transform a constrained problem into 
an unconstrained problem by adding weighted constraint-violation terms to the original 
objective function, with the premise that the new problem has a solution that is identical or 
close to the original one. 

For example, if there are only equality constraints, then 


Fœ) := fœ) + X aihio 
i=1 


for some constants a1, ..., am > 0 and integer p € {1,2}, gives an exact penalty function, 
in the sense that the minimizer of the penalized function f is equal to the minimizer of f 
subject to the m equality constraints h1, .. . , Am. With the addition of inequality constraints, 
one could use 


m k 
Flx) = f(x) + X ailh + X bj maxtg (x), 0} 


i=l j=l 
for some constants a1, ..., 4am, b1,..., bg > 0. 


E Example B.11 (Alternating Direction Method of Multipliers) The Lagrange method 
is designed to handle convex minimization subject to equality constraints. Nevertheless, 
some practical algorithms may still use the penalty function approach in combination with 
the Lagrangian method. An example is the alternating direction method of multipliers 
(ADMM) [17]. The ADMM solves problems of the form: 
min F(x) + g(2) 
xeR” ,zeR™ (B.29) 
subject to: Ax +Bz=c, 
where A € RP”, B € R’*”", and c € RP, and f : R” —> R and g : R” — R are convex func- 
tions. The approach is to form an augmented Lagrangian 


L(x, z, B) = f(x) + 9(z) + BT (Ax + Bz - c) + 5 lAx + Bz - ell’, 


where o > 0 is a penalty parameter, and $ € R? are dual variables. The ADMM then iterates 
through updates of the following form: 


x) = argmin L(x, 2, B®) 
xeR" 


gh) = argmin L(x, z, B®) 
zeR” 


pE? = p” + (Ax) + Bz) = c). 
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Suppose that (B.13) has inequality constraints only. Barrier functions are an important ex- 
ample of penalty functions that can handle inequality constraints. The prototypical example 
is a logarithmic barrier function which gives the unconstrained optimization: 


k 
fix) =f) -v X nOg), v>0, 


j=l 


such that the minimizer of f tends to the minimizer of f as v — 0. Direct minimization 
of f via an unconstrained minimization algorithm is frequently too difficult. Instead, it 
is common to combine the logarithmic barrier function with the Lagrangian method as 
follows. 

The idea is to introduce k nonnegative auxiliary or slack variables s,,..., są that satisfy 
the equalities g;(x) + s; = 0 for all i. These equalities ensure that the inequality constraints 
are maintained: g(x) = —s; < 0 for all j. Then, instead of the unconstrained optimization 
of f, we consider the unconstrained optimization of the Lagrangian: 


k k 
L(x, 8,B) = fœ) -v X nsj+ X Bej) +s), (B.30) 


j=l j=l 


where v > 0 and £ are the Lagrange multipliers for the equalities g;(x)+s; =0,j =1,...,k. 

Observe how the logarithmic barrier function keeps the slack variables positive. In 
addition, while the optimization of f is over n dimensions (recall that x € R”), the optimiz- 
ation of the Lagrangian function £ is over n + 2k dimensions. Despite this enlargement of 
the search space with the variables s and £, the optimization of the Lagrangian £ is easier 
in practice than the direct optimization of f. 


E Example B.12 (Interior-Point Method for Nonnegativity) One of the simplest and 
most common constrained optimization problems can be formulated as the minimization 
of f(x) subject to nonnegative x, that is: min,s9 f(x). In this case, the Lagrangian with 
logarithmic barrier (B.30) is: 


L(x, s,B) = f(x) - v», Ins, + B'(s — x). 
k 


The KKT conditions in Theorem B.2 are a necessary condition for a minimizer, and yield 
the nonlinear system for [x", s7, 6" ]" € R”: 


Vf(x) -B 
-v/s+ß | =0, 
s- x 


where v/s is a shorthand notation for a column vector with components {v/s ;}. To solve this 
system, we can use Newton’s method for root finding (see, for example, Algorithm B.3.1), 
which requires a formula for the matrix of Jacobi of £. Here, this (3n) x (3n) matrix is: 


H O -I 
H B 
Jc(x, s, P) =|O D I|= k 4 > 
-I I O 
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where H is the n x n Hessian of f at x; D := diag (v/(s © s)) is an n x n diagonal matrix; 
B” := [O, -I] is ann x (2n) matrix, and! 


R JR 


H, := (H - BE'B')' = (H + D). 


Further, we define 


Using this notation and applying the matrix blockwise inversion formula (A.14), we obtain 
the inverse of the matrix of Jacobi: 








E i = | H, -H,BE™ | = 5 i ana 
T _R-I pt -1 -IRT -1| 7 : X ae 
B’ E E'B'H, E~ +E-'B'H,BE -DH, I-DH, DH,D-D 


Therefore, the search direction in Newton’s root-finding method is given by: 








Vf(x) -B dx 
-J7 | -v/s +£ | = dx+x-s , 
Ss—-x v/s — B —-D(dx + x - s) 


where 
dx := -(H + D)! [v f(x) — 2v/s + Dx| 


and we have assumed that H + D is a positive-definite matrix. If at any step of the iteration 
the matrix H + D fails to be positive-definite, then Newton’s root-finding algorithm may 
fail to converge. Thus, any practical implementation will have to include a fail-safe feature 
to guard against this possibility. 

In summary, for a given penalty parameter v > 0, we can locate the approximate non- 
negative minimizer of f using, for example, the version of the Newton—Raphson root- 
finding method given in Algorithm B.4.1. 


In practice, one needs to choose a sufficiently small value for y, so that the output x, 
of Algorithm B.4.1 is a good approximation to x* = argmin,,9 f(x). Alternatively, one 
can create a decreasing sequence of penalty parameters vı > v2 > --- and compute the 
corresponding solutions x,,,x,,,... of the penalized problems. In the so-called interior- 
point method, a given x,, is used as an initial guess for computing x,,,, and so on until the 
approximation to the minimizer x* = argmin,., f(x) is deemed accurate. 





‘Here O is an n x n matrix of zeros and I is the n x n identity matrix. 
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Algorithm B.4.1: Approximating x* = argmin,,,9 f(x) with Logarithmic Barrier 
input: An initial guess x and stopping error € > 0. 
output: The approximate nonnegative minimizer x, of f. 

1s«x, Bev/s, dxf 

2 while ||dx|| > £ and budget is not exhausted do 

3 Compute the gradient u and the Hessian H of f at x. 

Sı — V/S, S2 4 Sı/S, we 2s; -U—-—S.OXx 

if (H + diag(s2)) > 0 then // if Cholesky successful 

Compute the Cholesky factor L satisfying LL' = H + diag(s>). 

dx <— L”'w (computed by forward substitution) 

dx <— L™'dx (computed by backward substitution) 

else 

10 dx — w/s» // if Cholesky fails, do steepest descent 


11 ds — dx +x -s, dB<—s,-B-s.,Ods, acl 


12 while min,{s; + æ ds;} < 0 do 
13 | æ — aj2 // ensure nonnegative slack variables 


© o y aAa n A 





14 x—x+adx, s—s+ads, B- ßB+adß 





return x, — x 


= 
on 


Further Reading 


For an excellent introduction to convex optimization and Lagrangian duality see [18]. A 
classical text on optimization algorithms and, in particular, on quasi-Newton methods is 
[43]. For more details on the alternating direction method of multipliers see [17]. 
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APPENDIX G 





PROBABILITY AND STATISTICS 





The purpose of this chapter is to establish the baseline probability and statistics 
background for this book. We review basic concepts such as the sum and product rules 
of probability, random variables and their probability distributions, expectations, in- 
dependence, conditional probability, transformation rules, limit theorems, and Markov 
chains. The properties of the multivariate normal distribution are discussed in more de- 
tail. The main ideas from statistics are also reviewed, including estimation techniques 
(such as maximum likelihood estimation), confidence intervals, and hypothesis testing. 


C.1 Random Experiments and Probability Spaces 


The basic notion in probability theory is that of a random experiment: an experiment 
whose outcome cannot be determined in advance. Mathematically, a random experiment is 
modeled via a triplet (Q, H, P), where: 


e Q is the set of all possible outcomes of the experiment, called the sample space. 


e H is the collection of all subsets of Q to which a probability can be assigned; such 
subsets are called events. 


e P is a probability measure, which assigns to each event A a number P[A] between 0 
and 1, indicating the likelihood that the outcome of the random experiment lies in A. 


Any probability measure P must satisfy the following Kolmogorov axioms: 
1. P[A] > 0 for every event A. 
2. P[Q] = 1. 


3. For any sequence A, A2,... of events, 
P[| Jail < X PIAA, (C.1) 


with strict equality whenever the events are disjoint (that is, non-overlapping). 
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When (C.1) holds as an equality, it is often referred to as the sum rule of probability. It 
simply states that if an event can happen in a number of different but not simultaneous 
ways, the probability of that event is the sum of the probabilities of the comprising events. 
If the events are allowed to overlap, then the inequality (C.1) is called the union bound. 

In many applications the sample space is countable; that is, Q = {a1,a2,...}. In this 
case the easiest way to specify a probability measure P is to first assign a number p; to 
each elementary event {a;i}, with }’, pi = 1, and then to define 


P[A] = » pi forall ACQ. 


i:ajEA 


Here the collection of events H can be taken to be equal to the collection of all subsets 
of Q. The triple (Q, H, P) is called a discrete probability space. This idea is graphically 
represented in Figure C.1. Each element a;, represented by a dot, is assigned a weight (that 
is, probability) p;, indicated by the size of the dot. The probability of the event A is simply 
the sum of the weights of all the outcomes in A. 





Figure C.1: A discrete probability space. 


E Remark C.1 (Equilikely Principle) A special case of a discrete probability space oc- 
curs when a random experiment has finitely many outcomes that are all equally likely. In 
this case the probability measure is given by 
|A| 
PIA] = => (C.2) 
IQ] 
where |A| denotes the number of outcomes in A and |Q] is the total number of outcomes. 
Thus, the calculation of probabilities reduces to counting numbers of outcomes in events. 
This is called the equilikely principle. E 


C.2 Random Variables and Probability Distributions 


It is often convenient to describe a random experiment via “random variables”, repres- 
enting numerical measurements of the experiment. Random variables are usually denoted 
by capital letters from the last part of the alphabet. From a mathematical point of view, a 
random variable X is a function from Q to R such that sets of the form {a < X < b} := 
{w € Q : a < X(w) < b} are events (and so can be assigned a probability). 
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All probabilities involving a random variable X can be computed, in principle, from its 
cumulative distribution function (cdf), defined by 


F(x) =P[X < x], xER. 


For example P[a < X < b] = P[X < b] — P[X < a] = F(b) - F(a). Figure C.2 shows a 
generic cdf. Note that any cdf is right-continuous, increasing, and lies between 0 and 1. 





a a 1 fo eee 
F(a) — 
— > 
0 £ 





Figure C.2: A cumulative distribution function (cdf). 


A cdf F4 is called discrete if there exist numbers x), x2, . . . and probabilities 0 < f(x;) < 
1 summing up to 1, such that for all x 
Fax) = )) fœ). (C.3) 


Such a cdf is piecewise constant and has jumps of sizes f(x), f(x2),... at points x1, .x2,..., 
respectively. The function f(x) is called a probability mass function or discrete probability 
density function (pdf). It is often easier to use the pdf rather than the cdf, since probabilities 
can simply be calculated from it via summation: 


PIX € B] = > f(x), 


xeB 


as illustrated in Figure C.3. 


f(x) 


lilh 


B 















































Figure C.3: Discrete probability density function (pdf). The darker area corresponds to the 


probability P[X € B]. 
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A cdf F, is called continuous! , if there exists a positive function f such that for all x 


F(x) = f i f(u) du. (C.4) 


Note that such an F, is differentiable (and hence continuous) with derivative f. The func- 
tion f is called the probability density function (continuous pdf). By the fundamental the- 
orem of integration, we have 


b 
Pla < X < b] = F(b) - F(a) af f(x) dx. 


Thus, calculating probabilities reduces to integration, as illustrated in Figure C.4. 


f(a) 





a zx b 


Figure C.4: Continuous probability density function (pdf). The shaded area corresponds to 
the probability PLX € B], with B being here the interval (a, b]. 


E Remark C.2 (Probability Density and Probability Mass) It is important to note that 
we deliberately use the same name, “pdf”, and symbol, f, in both the discrete and the 
continuous case, rather than distinguish between a probability mass function (pmf) and 
probability density function (pdf). From a theoretical point of view the pdf plays exactly 
the same role in the discrete and continuous cases. We use the notation X ~ Dist, X ~ f, 
and X ~ F to indicate that X has distribution Dist, pdf f, and cdf F. E 


Tables C.1 and C.2 list a number of important continuous and discrete distributions. 
Note that in Table C.1, T is the gamma function: (a) = ft ex de. a >Q. 





'In advanced probability, we would say “absolutely continuous with respect to the Lebesgue measure”. 
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Table C.1: Commonly used continuous distributions. 














Name Notation F(x) xE Parameters 
1 

Uniform Ula, B] —— [a,8] a<B 

p-a 

1 1 (34X2 
Normal Neu, o?) e07) R o>0,ueER 
oV2n 
AZ xl eTA 

Gamma Gamma(a, 4) — R+ a,A>O0 

T(q@) 

At x-e-le-ax 

Inverse Gamma _ InvGamma(a, 4) — R+ a,A>0 

Tr(a) 
Exponential Exp(4) Ac R+ A>0 

T(@+B) 4-4 L1 
Bet Beta(a, (d= 0,1 „8> 0 
eta (a.p) og CT 1 eg 
Weibull Weib(a, à) ad (AX) te" R, a@a,a>0 
Pareto Pareto(a, A) aA(1 + Ax) Or) R, a@,a>0 
reż 9\-(+1/2 
Student ty cee (1 + =) R v>0 
Var y 
T£) (m/n)"/2x0n-212 

F F(m, n) C nin) R, mneN, 


The Gamma(n/2, 1/2) distribution is called the chi-squared distribution with n degrees 


PETE + n/n 





p DISTRIBUTION 


of freedom, denoted x. The t; distribution is also called the Cauchy distribution. 


Table C.2: Commonly used discrete distributions. 





Name Notation F(x) xE Parameters 
Bernoulli Ber(p) wil=py {0,1} O<p<l 
Binomial Bin(n, p) () raser When VSS 
x nen 

i 1 
PSE i. cath k Meci ne {,2,..,} 
uniform n 
Geometric Geom(p) pi-py!  {1,2,..} O<p<l 

aA“ 

Poisson Poi(A) ee N A>O 


x! 
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C.3 Expectation 


It is often useful to consider different kinds of numerical characteristics of a random vari- 
able. One such quantity is the expectation, which measures the “average” value of the 
distribution. 

The expectation (or expected value or mean) of a random variable X with pdf f, denoted 
by EX or? E[X] (and sometimes u), is defined by 


BY = vx f(x) discrete case, 
g i p x f(x)dx continuous case. 


If X is a random variable, then a function of X, such as X? or sin(X), is again a random 
variable. Moreover, the expected value of a function of X is simply a weighted average of 
the possible values that this function can take. That is, for any real function h 


>, A(x) f(x) discrete case, 
f > h(x) f(x)dx continuous case, 


Eh(X) = | 
provided that the sum or integral are well-defined. 
The variance of a random variable X, denoted by Var X (and sometimes o”), is defined 
by 
Var X = E(X — E[X])” = EX? - (Ex)’. 


The square root of the variance is called the standard deviation. Table C.3 lists the expect- 
ations and variances for some well-known distributions. Both variance and standard devi- 
ation measure the spread or dispersion of the distribution. Note, however, that the standard 
deviation measures the dispersion in the same units as the random variable, unlike the 
variance, which uses squared units. 


Table C.3: Expectations and variances for some well-known distributions. 














Dist. EX Var X Dist. EX Var X 
Bin(n, p) np np(1 - p) Gamma(a, 4) 5 
1 i= 
Geom(p)  — £ Neo?) u o? 
p p 
, B 
Poi(a) a Beta(a, B) aad aes 
a+ (B-ay* : 2 
Ula, 6] > 5 Weib(a, 2) Fifa) are le) (TU) 
1 1 2 = 
Exp() os F(m,n) A02 or) (n>) 
ty 0 Wel)  @>2) 





*We only use brackets in an expectation if it is unclear with respect to which random variable the ex- 
pectation is taken. 
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It is sometimes useful to consider the moment generating function of a random variable 
X. This is the function M defined by 


M(s)=Ee™%, seéR. (C.5) 


The moment generating functions of two random variables coincide if and only if the ran- 
dom variables have the same distribution; see also Theorem C.12. 


E Example C.1 (Moment Generation Function of the Gamma(a, A) Distribution) Let 
X ~ Gamma(a, 4). For s < A, the moment generating function of X at s is given by 


o0 ew Ae xe! 
e** ————-_ dx 
1 I(a) 


a a i e U-s)x (A = s)? xe! a Qa 
= (2) | —<**ar-(). 
A= 8 0 I'(q@) A-s 
ns 


pdf of Gamma(a,A-s) 


M(s) = Ee 








For s > A, M(s) = ov. Interestingly, the moment generating function has a much simpler 
formula than the pdf. a 


C.4 Joint Distributions 


Distributions for random vectors and stochastic processes can be specified in much the 
same way as for random variables. In particular, the distribution of a random vector X = 
[X,,...,X,]' is completely determined by specifying the joint cdf F, defined by 


F(x,...,Xn) = PIX: <S x1,..., Xn S XJ, x ER, i= 1,...,n. 


Similarly, the distribution of a stochastic process, that is, a collection of random vari- 
ables {X,,t € 7}, for some index set .7, is completely determined by its finite-dimensional 
distributions; specifically, the distributions of the random vectors [X,,,...,X;,]' for every 
choice of n and f),..., tn. 

By analogy to the one-dimensional case, a random vector X = [X),...,X,]' taking 
values in R” is said to have a pdf f if, in the continuous case, 


P[X € B] = fræ dx, (C.6) 
B 


for all n-dimensional rectangles B. Replace the integral with a sum for the discrete case. 
The pdf is also called the joint pdf of X1, . . . , Xn. The pdfs of the individual components — 
called marginal pdfs — can be recovered from the joint pdf by “integrating out the other 
variables”. For example, for a continuous random vector [X, Y]' with pdf f, the pdf fx of 
X is given by 


fx(x) = f f(x, y)dy. 
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C.5 Conditioning and Independence 


Conditional probabilities and conditional distributions are used to model additional inform- 
ation on a random experiment. Independence is used to model lack of such information. 


C.5.1 Conditional Probability 


Suppose some event B € Q occurs. Given this fact, event A will occur if and only if 
A N B occurs, and the relative chance of A occurring is therefore P[A N B]/P[B], provided 
P[B] > 0. This leads to the definition of the conditional probability of A given B: 


P[A|B] = EA , if P[B] > 0. (C7) 


The above definition breaks down if P[B] = 0. Such conditional probabilities must be 
treated with more care [11]. 
Three important consequences of the definition of conditional probability are: 


1. Product rule: For any sequence of events A1, A2,...,An, 
P[A, --- An] = P[Aj] P[A2| Ai] P[A3|AiA2]--- PLA, |41 ++ + An-1], (C.8) 
using the abbreviation A,A2--- Ak := Ay ANAN ++: N Ak. 


2. Law of total probability: If {B;} forms a partition of Q (that is, B; 0 B; = Ø, i + j and 
U;B; = Q), then for any event A 


P[A] = » PIA | Bi] PLB. (C.9) 


3. Bayes’ rule: Let {B;} form a partition of Q. Then, for any event A with P[A] > 0, 


P[A | B;] P[Bj] 


PIB; |A] = > pra] BIPIBI 


(C.10) 





C.5.2 Independence 


Two events A and B are said to be independent if the knowledge that B has occurred does 
not change the probability that A occurs. That is, A, B independent © P[A |B] = P[A]. 
Since P[A | B] PLB] = P[A A B], an alternative definition of independence is 


A, B independent © P[A N B] = P[A] P[B]. 


This definition covers the case where P[B] = 0 and can be extended to arbitrarily many 
events: events A;,A2,... are said to be (mutually) independent if for any k and any choice 
of distinct indices i1, ... , ix, 


P[A; N Ai N+: O A] = P[A;,] P[A:]--- PLA; ]. 
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The concept of independence can also be formulated for random variables. Random 
variables X,, X>,... are said to be independent if the events {X;, < x;,},...,{X;, < x;,} are 
independent for all finite choices of n distinct indices i1, .. . , in and values x;,,..., Xi. 


An important characterization of independent random variables is the following (for a 
proof, see [101], for example). 


Theorem C.1: Independence Characterization 





Many probabilistic models involve random variables X1, X2,... that are independent 
and identically distributed, abbreviated as iid. We use this abbreviation throughout this 
book. 


C.5.3 Expectation and Covariance 


Similar to the univariate case, the expected value of a real-valued function h of a random 
vector X ~ f is a weighted average of all values that h(X) can take. Specifically, in the 
continuous case, Eh(X) = f h(x) f (x) dx. In the discrete case replace this multidimensional 
integral with a sum. Using this result, it is not difficult to show that for any collection of 
dependent or independent random variables X,,..., Xn, 


Bla + bX, + bX, +--+- + b, Xn] = a+b, EX, +---+b,EX, (C.12) 
for all constants a, b,,...,b,. Moreover, for independent random variables, 
EIX X -< - X, ] = EX, EX,---EX,. (C.13) 


We leave the proofs as an exercise. 
The covariance of two random variables X and Y with expectations ux and uy, respect- 
ively, is defined as 
Cov(X, Y) = E[(X - px)(Y - py)]. 


This is a measure of the amount of linear dependency between the variables. Let 0% = 
Var X and a}, = VarY. A scaled version of the covariance is given by the correlation 
coefficient, 
Cov(Xx, Y) 

Ox Oy ` 


o(X, Y) = 
The following properties follow directly from the definitions of variance and covariance. 
1. VarX = EX? - p. 
2. Var[aX +b] = a’ o%. 
3. Cov(X, Y) = E[XY] — Hy Hry. 
4. Cov(X, Y) = Cov(Y, X). 
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5. —OxOy S Cov(X, Y) < OxOy. 
6. Cov(aX + bY, Z) = aCov(X, Z) + b Cov(y, Z). 
7. Cov, X) =o}. 
8. Var[X + Y] = o3 + o? + 2 Cov(X, Y). 
9. If X and Y are independent, then Cov(X, Y) = 0. 
As a consequence of Properties 2 and 8 we have that for any sequence of independent 
random variables X,,..., X, with variances o7,...,07, 
Var[a,X; + a2X2 +--+ +a,X,] = a; o + a; oe ++ a? o, (C.14) 
for any choice of constants a1, ... , an. 
For random column vectors, such as X = [X;,..., X„]", it is convenient to write the 
expectations and covariances in vector and matrix notation. For a random vector X we 
EXPECTATION define its expectation vector as the vector of expectations 
VECTOR 
H = [ki .--, Hal” = [EX1 .-. ,EX,]". 
COVARIANCE The covariance matrix & is defined as the matrix whose (i, j)-th element is 
MATRIX 
Cov(X;, Xj) = EIX; - uX; - a)l. 
If we define the expectation of a vector (matrix) to be the vector (matrix) of expecta- 
tions, then we can compactly write 
H = EX 
and 
E = BX - p(X -p)"]. (C.15) 
t= 357 A useful application of the cyclic property of the trace of a matrix (see Theorem A.1) is 


CONDITIONAL PDF 


the following. 


Theorem C.2: Expectation of a Quadratic Form 





Proof: Since Y is a scalar, it is equal to its trace. Now, using the cyclic property: EY = 
Etr(Y) = Etr(X'AX) = Etr(AXX') = tr(AE[XX']) = (AŒ + pp')) = tr(AX) + 
tr(App") = tr(AX) + p" Ap. Oo 


C.5.4 Conditional Density and Conditional Expectation 


Suppose X and Y are both discrete or both continuous, with joint pdf f, and suppose fy(x) > 
0. Then, the conditional pdf of Y given X = x is given by 


f(x,y) 
fx 





fnxO lx) = for all y. (C.16) 
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In the discrete case, the formula is a direct translation of (C.7), with fyx(y|x) = PLY = 
y|X = x]. In the continuous case, a similar interpretation in terms of densities can be used; 
see, for example, [101, Page 221]. The corresponding distribution is called the conditional 
distribution of Y given X = x. Note that (C.16) implies that 


f(x,y) = fsx) frixv| x). 


This is useful when the marginal and conditional pdfs are given, rather than the joint one. 
More generally, for the n-dimensional case we have 


[Missi Xn) = Fx, (01) fan |x, 2 | eu) * PAR ae Xr Xn (Aireses Ani) (C.17) 


which is in essence a rephrasing of the product rule (C.8) in terms of probability densities. 


As a conditional pdf has all the properties of an ordinary pdf, we may define expecta- 
tions with respect to it. The conditional expectation of a random variable Y given X = x is 
defined as 


Yy»¥SyxQ|x) discrete case, 


. (C.18) 
f y fyix(y| x) dy continuous case. 


By ix=a1=| 


Note that E[Y | X = x] is a function of x. The corresponding random variable is written 
as E[Y |X]. A similar formalism can be used when conditioning on a sequence of random 
variables X),...,X,. The conditional expectation has similar properties to the ordinary 
expectation. Other useful properties (see, for example, [127]) are: 


1. Tower property: If EY exists, then 


E ELY |X] = EY. (C.19) 


2. Taking out what is known: If EY exists, then 


E[XY |X] = XE[Y |X]. 


C.6 Functions of Random Variables 


Let x = [x),...,%,]' be a column vector in R” and A an mxn matrix. The mapping x > z, 
with z = Ax, is a linear transformation, as discussed in Section A.1. Now consider a 
random vector X = [X;,...,X,]' and let Z := AX. Then Z is a random vector in R”. The 
following theorem details how the distribution of Z is related to that of X. 
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Theorem C.3: Linear Transformation 





Proof: We have uz = EZ = E[AX] = AEX = Apy and 
Ly = EKZ- pzXZ —pz)"] = BIA(X - py (A(X - py)" 

AE[(X - wy)(X - py) JA" 

= A Èy A‘. 


For A invertible and X continuous (as opposed to discrete), let z = Ax and x = Av!z. 
Consider the n-dimensional cube C = [z1,z1 +h] X +--+ X [Zn Z + A]. Then, 


P[Z € C] ~ h” fz(2), 


by definition of the joint density of Z. Let D be the image of C under A7! — that is, all 
points x such that Ax € C. Recall from Section A.1 that any matrix B linearly transforms an 
n-dimensional rectangle with volume V into an n-dimensional parallelepiped with volume 
V | det(B)|. Thus, in addition to the above expression for P[Z € C], we also have 


P[Z € C] = P[X € D] ~ h"|det(A“')| fx(x) = h"| det(A)! fx(x). 
Equating these two expressions for P[Z € C], dividing both sides by h”, and letting h go to 


0, we obtain (C.22). o 


For a generalization of the linear transformation rule (C.22), consider an arbitrary map- 
ping x > g(x), written out: 


Xı g(x) 
X2 8g2(x) 
. => . * 


Xn 8n(X) 


wee 


«É 


pS 
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Theorem C.4: Transformation Rule 





Proof: For a fixed x, let z = g(x); and thus x = g~! (z). In the neighborhood of x, the 
function g behaves like a linear function, in the sense that g(x + 6) ~ g(x) + Jg(x) 6 for 
small vectors 6; see also Section B.1. Consequently, an infinitesimally small n-dimensional 
rectangle at x with volume V is transformed into an infinitesimally small n-dimensional 
parallelepiped at z with volume V | det(J,(x))|. Now, as in the proof of the linear case, let 
C be a small cube around z = g(x) with volume h”. Let D be the image of C under g™!. 
Then, 


h” f7(z) ~ PIZ € C] © A" ga 2) fx), 
and since |det(J ,-1(z))| = 1/| det(Jg(x))|, (C.23) follows as h goes to 0. o 


Typically, in coordinate transformations it is g~! that is given — that is, an expres- 


sion for x as a function of z. 





E Example C.2 (Polar Transform) Suppose X, Y are independent and have standard nor- 
mal distribution. The joint pdf is 


l a 
fxy(x, y) = oe (x, y) = R’. 
In polar coordinates we have 
X = Rcos® and Y = Rsin O, (C.24) 


where R > 0 is the radius and © € [0, 27) the angle of the point (X, Y). What is the joint pdf 
of R and ©? By the radial symmetry of the bivariate normal distribution, we would expect 
© to be uniform on (0, 27). But what is the pdf of R? To work out the joint pdf, consider 
the inverse transformation g~!, defined by 


> [l = 


The corresponding matrix of Jacobi is 


sin @ rcosé 


Jen, 0) = | 


cos@ -rsin | 
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which has determinant r. Since x? + y? = r?(cos” @ + sin? 0) = r°, it follows by the trans- 
formation rule (C.23) that the joint pdf of R and © is given by 


1 
trot 0) = fxy(x,y)r = rol r, 0 € (0, 27), r> 0. 


By integrating out 6 and r, respectively, we find fr(r) = re /? and fo(@) = 1/(27). Since 
fro is the product of fr and fo, the random variables R and © are independent. E 


C.7 Multivariate Normal Distribution 


The normal (or Gaussian) distribution — especially its multidimensional version — plays 
a central role in data science and machine learning. Recall from Table C.1 that a random 
variable X is said to have a normal distribution with parameters u and o° if its pdf is given 
by 

1 


ov2n 





E aF xeR. (C.25) 


We write X ~ N(u, o°). The parameters u and g? are the expectation and variance of the 
distribution, respectively. If u = 0 and æ = 1 then 


-$ en 
Me" ; 


and the distribution is known as the standard normal distribution. The cdf of the standard 
normal distribution is often denoted by ® and its pdf by y. In Figure C.5 the pdf of the 
N(u, o?) distribution for various u and g? is plotted. 


0.8 


0.6 5 








Figure C.5: The pdf of the N(u, o°) distribution for various u and o°. 


We next consider some important properties of the normal distribution. 


Theorem C.5: Standardization 
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Proof: The cdf of Z is given by 


PIZ < z] = P(X - w/o < z] =P[X < u + oz] 


o 1 =f EY f 1 -3/24 O( ) 
= e T x= = *” y= Z), 
—00 ov2n -œ V2n 





where we make a change of variable y = (x — w)/o in the fourth equation. Hence, Z ~ 
N(O, 1). Oo 


The rescaling procedure in Theorem C.5 is called standardization. It follows from The- 
orem C.5 that any X ~ N(u, o°) can be written as 


X=putoZ, where Z ~ N(O, 1). 


In other words, any normal random variable can be viewed as an affine transformation — 
that is, a linear transformation plus a constant — of a standard normal random variable. 

We now generalize this to n dimensions. Let Z|,...,Z, be independent and standard 
normal random variables. The joint pdf of Z = [Z,,...,Z,]' is given by 


n 


1 n oT 
f(z) = [| eM = (nye tt, eR". (C.26) 


i=l N 
We write Z ~ N(0, I), where I is the identity matrix. Consider the affine transformation 
X=pu+BZ (C.27) 


for some m x n matrix B and m-dimensional vector u. Note that, by (C.20) and (C.21), X 
has expectation vector u and covariance matrix X = BB'. We say that X has a multivariate 
normal or multivariate Gaussian distribution with mean vector u and covariance matrix X. 
We write X ~ N(u, X). 

The following theorem states that any affine combination of independent multivariate 
normal random variables is again multivariate normal. 


Theorem C.6: Affine Transformation of Normal Random Vectors 





Proof: Denote the n-dimensional random vector in the left-hand side of (C.28) by Y. By 
definition, each X; can be written as uw; + A;Z;, where the {Z;} are independent (because the 
{X;} are independent), so that 


Y=a+ Y BU + AZ) = av X Biu + Y BAZ 
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which is an affine combination of independent standard normal random vectors. Hence, Y 
is multivariate normal. Its expectation vector and covariance matrix can be found easily 
from Theorem C.3. m 


The next theorem shows that the distribution of a subvector of a multivariate normal 
random vector is again normal. 


Theorem C.7: Marginal Distributions of Normal Random Vectors 





Proof: We give a proof assuming that È is positive definite. Let BB' be the (lower) 
Cholesky decomposition of X. We can write 


Xp] _ A e d K 
H H “Tle, c [z (C.30) 
B 


where Z, and Z, are independent p- and q-dimensional standard normal random vectors. 
In particular, X, = Hp + B,Zp, which means that X, ~ Nu, Xp), since B,B, =X). o 


By relabeling the elements of X we see that Theorem C.7 implies that any subvector of 
X has a multivariate normal distribution. For example, X, ~ N(u,, £4). 

The following theorem shows that not only the marginal distributions of a normal ran- 
dom vector are normal, but also its conditional distributions. 


Theorem C.8: Conditional Distributions of Normal Random Vectors 





Proof: From (C.30) we see that X, = u,+B,Z, and X; = u,+C,Z,+C,Z,. Consequently, 


(Xal Xp = Xp) = p, + C, Bp (xp - Hp) + CyZz, 


where Z, is a q-dimensional multivariate standard normal random vector. It follows that 
X, conditional on X, = x, has a Này, + C, B7' (x, — up), C{C]) distribution. The proof of 
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(C.31) is completed by observing that £7 2>' = C,B;(B})'B;' = C,B,', and 


-1 2 -1 — 
H-n OC eC e Se Oc. 


B,C] 


If X, and X, are independent, then they are obviously uncorrelated, as X, = E[(X, - 
H(X; - H,)'] = K(X, - u) EX, - Ha)” = O. Conversely, if &, = O, then by (C.31) the 
conditional distribution of X, given X, is the same as the unconditional distribution of Xj; 
that is, N(u g Ł,). In other words, X, is independent of X,. o 


The next few results are about the relationships between the normal, chi-squared, 
Student, and F distributions, defined in Table C.1. Recall that the chi-squared family of 
distributions, denoted by y2, are simply Gamma(n/2, 1/2) distributions, where the para- 
meter n € {1,2,3,...} is called the degrees of freedom. 


Theorem C.9: Relationship Between Normal and y° Distributions 





Proof: Let BB" be the Cholesky decomposition of £, where B is invertible. Since X can 
be written as u + BZ, where Z = [Z),...,Z,]' is a vector of independent standard normal 
random variables, we have 


(X= p(X -pH = (X - p) BBY X - p) = 27Z = z. 
i=1 
Using the independence of Z,,...,Z,, the moment generating function of Y = X; z? is 
given by 
Ee’ = E es Zit =+) >E [esi S en] = (E e7’) , 


where Z ~ N(0, 1). The moment generating function of Z? is 


= 1 2 1 > a 2 1 

Ee” — f e? e™ /2dz = Í e72(l-2s)z dz = ; 

= V2 V2T J-% vV1—-2s 

so that Ee = (3 IG - s))’, s <4, which is the moment generating function of the 
Gamma(n/2, 1/2) distribution; that is, the y? distribution — see Example C.1. The res- 
ult now follows from the uniqueness of the moment generating function. Oo 





A consequence of Theorem C.9 is that if X = [X\,...,X,]" is n-dimensional standard 
normal, then the squared length ||X||? = X? ++ X? has a y2 distribution. If instead X; ~ 
N(u;, 1), i = 1,..., then ||X||? is said to have a noncentral x? distribution. This distribution 
depends on the {u;} only through the norm ||y||. We write ||X||?_ ~ X0), where 0 = |jal| is 
the noncentrality parameter. 

Such distributions frequently occur when considering projections of multivariate nor- 
mal random variables, as summarized in the following theorem. 


x DISTRIBUTION 


us 427 


we 427 


NONCENTRAL x 
DISTRIBUTION 


NONCENTRALITY 
PARAMETER 


438 


C.7. Multivariate Normal Distribution 





w= 362 


r= 361 


rs 182 


Theorem C.10: Relationship Between Normal and Noncentral y? Distributions 





Proof: Let vj,...,¥, be an orthonormal basis of R” such that vj,...,v, spans V, and 
V1,.--,¥m Spans Vm. By (A.8) we can write the orthogonal projection matrices onto V;, 
as P; = si viv;, j = k,m,n, where V, is defined as R”. Note that P, is simply the iden- 
tity matrix. Let V := [v1,...,vV„] and define Z := [Z,,...,Z,]' = V'X. Recall from Sec- 
tion A.2 that any orthogonal transformation such as z = V'x is length preserving; that is, 


IIzll = Ixl]. 


To prove the first statement of the theorem, note that V'X; = V'P,X = [Z,,...,Z;, 
0,...,0]', j = k,m. It follows that V'(X,, — X;) = [0,...,0, Zp41,...,Zm,0,...,0]' and 
V(X -X,,) = [0,...,0, Zins1,-.-,Z,]'. Moreover, being a linear transformation of a nor- 
mal random vector, Z is also normal, with covariance matrix V'V = I. In particular, the 
{Z;} are independent. This shows that X;, Xm — X; and X — Xn are independent as well. 


Next, observe that ||X;|| = IIV Xx] = ||Zz||, where Z; := [Z),...,Z,]'. The latter vector 
has independent components with variances 1, and its squared norm has therefore (by 
definition) a XO) distribution. The noncentrality parameter is 6 = ||EZ;|| = IEX;l| = Illl, 
again by the length-preserving property of orthogonal transformations. This shows that 
Xk? ~ x2). The distributions of ||Xin — X|}? and ||X — Xnl? follow by analogy. o 


Theorem C.10 is frequently used in the statistical analysis of normal linear models; see 
Section 5.4. In typical situations u lies in the subspace V, or even V, — in which case 
Xin — X? ~ xZ and ||X — Xnll? ~ X2_n independently. The (scaled) quotient then turns 
out to have an F distribution — a consequence of the following theorem. 


Theorem C.11: Relationship Between y° and F Distributions 





Proof: For notational simplicity, let c = m/2 and d = n/2. The pdf of W = U/V is 
given by fw(w) = Í, fu(wv)v fy(v)dv. Substituting the pdfs of the corresponding Gamma 
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distributions, we have 


oo (wv)! enw /2 pt-le-v/2 we! f PERS l 
= — y — dv = c+ (1+w)v/2 
aes Í, Ox TA Ter@24j,” S a 
T(c+d) wt! 


Tod) (+w? 
where the last equality follows from the fact that the integrand is equal to ['(@)A® times the 
density of the Gamma(a, 4) distribution with a = c + d and A = (1 + w)/2. The density of 
— n U . . 
Z = ~~ is given by 
fz) = fwlem/n) m/n. 
The proof is completed by comparing the resulting expression with the pdf of the F distri- 
bution given in Table C.1. o ns 425 


Corollary C.1 (Relationship Between Normal, y”, and ¢ Distributions) Let Z ~ N(0, 1) 


and V ~ y? be independent. Then, 
Z 


Win 


Proof: Let T = Z/VV/n. Because Z? ~ y;, we have by Theorem C.11 that T? ~ F(1,7). 
The result follows now from the symmetry around 0 of the pdf of T and the fact that the 
square of a t, random variable has an F(1, n) distribution. o 





~ tn- 


C.8 Convergence of Random Variables 


Recall that a random variable X is a function from Q to R. If we have a sequence of random 
variables X,,X>,... (for instance, X„,(w) = X(w)+ 1 for each w € Q), then one can consider 
the pointwise convergence: 


lim X, (w) = X(w), forall w €Q, 


in which case we say that X1, X2,... converges surely to X. A more interesting type of SURE 
convergence uses the probability measure P associated with X. CONVERGENCE 


Definition C.1: Convergence in Probability 


The sequence of random variables X1, X2,... converges in probability to a random 
variable X if, for all € > 0, 


lim P [|X, — X| > €] = 0. 


5 m P 
We denote the convergence in probability as Xġ, — X. CONVERGENCE IN 
PROBABILITY 
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Convergence in probability refers only to the distribution of X,. Instead, if the sequence 
X,,Xz,... is defined on a common probability space, then we can consider the following 
mode of convergence that uses the joint distribution of the sequence of random variables. 


Definition C.2: Almost Sure Convergence 
The sequence of random variables X1, X2,... converges almost surely to a random 


variable X if for every £ > 0 


lim P | sup |X, — X| > €} = 0. 


n>% su 


a.s. 
We denote the almost sure convergence as Xġ,— X. 





; i iB, cal P 
Note that in accordance with these definitions X, => 0 is equivalent to sup,,,, |X;| — 0. 


E Example C.3 (Convergence in Probability Versus Almost Sure Convergence) Since 
the event {|X,, — X| > €} is contained in {sup,,,, |X; — X| > £}, we can conclude that almost 
sure convergence implies convergence in probability. However, the converse is not true in 
general. For instance, consider the iid sequence X1, X2,... with marginal distribution 


PIX, = 1] = 1 - P[X, = 0] =1/n. 


Clearly, X, “5 0. However, for e < 1 and any n = 1,2,... we have, 


P [sup |[X;| < €| = P[X, < £, Xn+1 S €,...] 


kèn 





P[X, < £] X P[X,41 < £] x --- (using independence) 


m 


m 1 
= lim | | PIX; < £] = lim [If - i] 











k=n k=n 
-1 -1 
zim x E =0. 
m>% nN n+1 m 
It follows that P[sup,;,,, |X; — O| > £] = 1 for any 0 < e < 1 and all n > 1. In other words, it 
is not true that X,—> 0. E 


Another important type of convergence is useful when we are interested in estimating 
expectations or multidimensional integrals via Monte Carlo methodology. 
Definition C.3: Convergence in Distribution 


The sequence of random variables X,, Xz,... is said to converge in distribution to a 
random variable X with distribution function F'y(x) = P[X < x] provided that: 


lim P[X, < x] = Fy(x) for all x such that lim Fy(a) = Fy(x). (C.33) 


ere ee p d d 
We denote the convergence in distribution as either X, — X, or X, — Fy. 
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The generalization to random vectors replaces (C.33) with 
lim P[X, € A] = PLX € A] for all A c R” such that PLY € dA] = 0, (C.34) 
where OA denotes the boundary of the set A. 
A useful tool for demonstrating convergence in distribution is the characteristic func- 
tion wy of a random vector X, defined as the expectation: SACR 
pe FUNCTION 
Yx(t):=Ee* ¥, teR. (C.35) F 225 
The moment generating function in (C.5) is a special case of the characteristic function 
evaluated at t = —is. Note that while the moment generating function of a random variable 
may not exist, its characteristic function always exists. The characteristic function of a 
random vector X ~ f is closely related to the Fourier transform of its pdf f. re 390 
E Example C.4 (Characteristic Function of a Multivariate Gaussian Random Vector) 
The density of the multivariate standard normal distribution is given in (C.26) and thus the 
characteristic function of Z ~ N(0,I,) is 
Wz(t) = Ee’? = gref elt zall? dz 
R” 
= eMP/2 2r)". f enile- gz = elP/2 4 eR”, 
R” 
Hence, the characteristic function of the random vector X = u + BZ in (C.27) with mul- 
tivariate normal distribution N(u, X) is given by ms 435 


Wx(t) — Ee x _ E eit (#+B2) 
= et HE iB 'Z = ey 7(B't) 
= eif H-IBT AI? /2 = eif u-t Et/2 
E 


The importance of the characteristic function is mainly derived from the following 
result, for which a proof can be found, for example, in [11]. 


Theorem C.12: Characteristic Function 
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E Example C.5 (Convergence in Distribution) Define the random variables Yj, Y2,... 


as 
n 1\ 
D n=1,2,..., 
k=1 
iid 


where X1, X2,... ~ Ber(1/2). We now show that Y, U(0, 1). First, note that 
Eexp(itY,) = [| Eexp(itX;/2) = 27” | [a + exp(it/2')). 
k=1 k=1 


Second, from the collapsing product, (1 — exp(it/2”)) []7_,(1 + exp(it/ 2*)) = 1 — exp(in), 
we have 


f ; 1/2” 
E tY, = (1-— t)) ———__.. 
explitYn) = (1 — exp(i) Gan 
It follows that lim,_,.. Eexp(itY,,) = (exp(it) — 1)/(it), which we recognize as the charac- 
teristic function of the U(0, 1) distribution. a 


Yet another mode of convergence is the following. 


Definition C.4: Convergence in L’?-norm 


The sequence of random variables X,, X2,... converges in L?-norm to a random 
variable X if 
lim E|X, -X/?=0, pæl. 


n—oo 


3 LP 
We denote the convergence in L?-norm as X, — X. 





The case for p = 2 corresponds to convergence in mean squared error. The following 
example illustrates that convergence in L?-norm is qualitatively different from convergence 
in distribution. 

E Example C.6 (Comparison of Modes of Convergence) Define X, := 1 — X, where X 


has a uniform distribution on the interval (0,1). Clearly, X,, 4, UO, 1). However, E|X,, — 
X| — Ell — 2X| = 1/2 and so the sequence does not converge in L!-norm. In addition, 
P[|X, — X| > €] — 1 — £ + 0 and so X, does not converge in probability as well. 

1 


: d . , f P L 
Thus, in general X, — X implies neither X, — X, nor X, — X. 
; d P 
We mention, however, that if X, — c for some constant c, then X, — c as well. To 


; d 
see this, note that X, —> c stands for 


D ae 
lim PIX, < x] -| aor 
noo 0, x<C 


In other words, we can write: 
P[|IX, —c| > e] < 1-P[X, < c + £] + PIX, <c-—e]—1-14+0=0, now, 


which shows that X, 4, c by definition. E 
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Definition C.5: Complete Convergence 


The sequence of random variables X1, X2,... is said to converge completely to X if 
foralle >0 


> PIX ~X|>e] < o. 


cpl. 
We denote the complete convergence as X, — X. COMPLETE 
CONVERGENCE 





E Example C.7 (Complete and Almost Sure Convergence) We show that complete 
convergence implies almost sure convergence. We can bound the criterion for almost sure 
convergence as follows: 


P[sup |X; — X| > £] = P[Upsn{|X; — XI > e} 
kèn 
< D PIX; — X| > £] by union bound in (C.1) 
kèn 
foe) n-1 
< $ PIX- XI > £1- > PIX - XI > el 


k=1 k=1 





cpl. 
=c<oo from X, —>X 


n-1 
<c- ) PIX -XI| > £l > c-c =0, n— œ. 
k=l 


Hence, by definition x X. E 


The next theorem shows how the different types of convergence are related to each 


, . . Pq 
other. For example, in the diagram below, the notation > means that L?-norm convergence 
implies L1-norm convergence under the assumption that p > q > 1. 


Theorem C.13: Modes of Convergence 
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mS 62 


CHEBYSHEV’ S 
INEQUALITY 


Proof: 1. First, we show that X, — X = X, —> X using the inequality P[A N B] < P[A] 
for any event B. To this end, consider the distribution function Fx of X: 

Fy (x) = P[X, < x] = PIX, < x, |X, — X| > £] + PIX, < x, |X, — XI < £] 
P[|X, — X| > £] + PIX, < x, X < X, +€] 


<S 
< P[|X, — X| > e] + P[X < x+ el. 


Now, in the arguments above we can switch the roles of X, and X (there is a symmetry) to 
deduce the analogous result: F(x) < P[|X — X,| > €] + P[X, < x + £]. Therefore, making 
the switch x > x — €e gives Fx(x — £) < P[|X — X,| > £] + Fx,(x). Putting it all together 
gives: 

Fy(x — £) — P[|X — X,| > £] < Fy,(x) < PIIX, — X| > £] + Fx(x + 8). 


Taking n — œ on both sides yields for any € > 0: 
Fy(x — £) < lim Fy,(x) < Fx(x + 8). 
Since Fy is continuous at x by assumption we can take eļ O to conclude that 
lim, +00 Fx, (x) = F x(x). 
P a 

2. Second, we show that X, Z X >X, ay X for p > q = 1. Since the function 

f(x) = x4/? is concave for g/p < 1, Jensen’s inequality yields: 
(EIX|?)"”” = fEIX|’) > Ef(X’) = EIX. 


In other words, (E|X,,—X|1)!/4 < (E|X,—X|?)'/? — 0, proving the statement of the theorem. 


3. Third, we show that X, ea X=> X, ans X. First note that for any random variable 
Y, we can write: E|Y| > E[|Y| Lyy 2] > Ellel Lire] = ¢PLY| > £]. Therefore, we obtain 
Chebyshev’s inequality: 
EIY| 


P[|Y| > £] < —. (C.36) 
E 
1 
Using Chebyshev’s inequality and X, ae , we can write 


EIX, — XI 
P[|X, - X| > e] < == — 


0, now. 


Hence, by definition X, “5 X. 
cpl. a.s. 
4. Finally, Xn P; X Xı X > X, ea X is proved in Examples C.7 and C.3. oO 


Finally, we will make use of the following theorem. 


Theorem C.14: Slutsky 
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Proof: We prove the theorem for scalar X and Y. The proof for random vectors is analog- 


ous. First, we show that Z,, := 4, xl =: Z using, for example, Theorem C.12. In iS 441 


other words, we wish to show that the characteristic function of the joint distribution of X, 
and Y,, converges pointwise as n — oo: 


Wx, y, (t) = E elliXnte¥n) — e E etx = Wx -(t), Yt € R?. 
To show the limit above, consider 


lWx,.¥n(£) — Wx) < Ix, lt) — Yx l) + Yx, rÉ) - Wx, CO 
2 je?" E (eit Z eX) + |E elliXnti20) (eit n=0) 7 ib) 
lel2¢} x |E(ei* = el) +E jel %nth0)| x jei -9 = 1| 


lx, (ti) — Wx(t))| +E j2% — 1]. 


IN’ IN 


Since X, a X, Theorem C.12 implies that wy (t;) — wx(t)), and the first term |Wy, (t1) — 
Wx(t1)| goes to zero. For the second term we use the fact that 


le" - 1| = 





f aol =\x, xeR 





X aib : 
fi 6] < lei 
to obtain the bound: 

Ble) — 1| = Ble) — 1 fay, -ctsey + Ele- — 1 yy, -cicey 


2E Lyy,-c>e} + Elt2(Yn — c)|Luy,-c<ey 


< 
< 2PIlY, — c| > e] + Ihle — |ble, now. 


Since ¢ is arbitrary, we can let e | 0 to conclude that lim„—o |Wx,,y,(Q)—-Wx,-(t)| = 0. In other 


words, Z,, Ay Z, and by the continuity of g, we have g(Z,,) A g(Z) or 8(Xn, Yn) Ean 
g(X,c). o 


E Example C.8 (Necessity of Slutsky’s Condition) The condition that Y„ converges in 
probability to a constant cannot be relaxed. For example, suppose that g(x,y) = x + y, 


Xn 4 X ~ N(O, 1) and Y, ss Y ~ N(O, 1). Then, our intuition tempts us to incorrectly 


conclude that X,, + Y,, ze N(O, 2). This intuition is false, because we can have Y,, = —X, 
for all n so that X,, + Y, = 0, while both X and Y have the same marginal distribution (in 
this case standard normal). E 


C.9 Law of Large Numbers and Central Limit Theorem 


Two main results in probability are the law of large numbers and the central limit theorem. 
Both are limit theorems involving sums of independent random variables. In particular, 
consider a sequence X,, X2,... of iid random variables with finite expectation u and finite 
variance o°. For each n define X, := (X; +---+X,,)/n. What can we say about the (random) 


sequence of averages X1, X2, X3,...? By (C.12) and (C.14) we have E X, = uand Var X, = i= 429 
o? /n. Hence, as n increases, the variance of the (random) average X„ goes to 0. This means 
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LAW OF LARGE 
NUMBERS 


n= 355 


ns 44] 


STRONG LAW OF 
LARGE NUMBERS 


a = : ae D 
that by Definition C.8, the average X,, converges to u in L?-norm as n —> œ, that is, X, — 
LL 

In fact, to obtain convergence in probability the variance need not be finite — it is 
sufficient to assume that u = EX < œ. 


Theorem C.15: Weak Law of Large Numbers 





The theorem has a natural generalization for random vectors. Namely, if y = EX < oo, 
then P [X,, — pl| > e] — 0, where ||- || is the Euclidean norm. We give a proof in the scalar 
case. 

Proof: Let Z := X, — for all k, so that EZ = 0. We thus need to show that Z, Z. 
We use the properties of the characteristic function of Z denoted as wz. Due to the iid 
assumption, we have 


Yz O = Eel = B| [ei =| [Be =| [yem = yzi. (C37) 
i=1 i=1 i=1 


An application of Taylor’s Theorem B.1 in the neighborhood of t = 0 yields 
W2(t/n) = Wz(0) + O(t/n). 
Since wz(0) = 1, we have: 
Wz, (t) = [W2(t/n)|" = 0 +00) — 1, n> œ. 


The characteristic function of a random variable that always equals zero is 1. Therefore, 
> d 
Theorem C.12 implies that Z, —> 0. However, according to Example C.6, convergence in 


ee 3 x : z E =e. — 
distribution to a constant implies convergence in probability. Hence, Z, — 0. o 


There is also a stronger version of this theorem, as follows. 


Theorem C.16: Strong Law of Large Numbers 
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Proof: First, note that any random variable X can be written as the difference of two non- 
negative random variables: X = X, — X_, where X, := max{X,0} and X_ := — min{X, 0}. 
Thus, without loss of generality, we assume that the random variables in the theorem above 
are nonnegative. 

Second, from the sequence {X1, X2, X3,...} we can pick up the subsequence (X1, X4, Xo, 
Xis ...} =: {X pj. Then, from Chebyshev’s inequality (C.36) and the iid condition, we have 


Slee ane] MET con 


j=l j=l 





— cpl. = as. 
Therefore, by definition X,2 = 4 and from Theorem C.13 we conclude that X,,2—> u. 
Third, for any arbitrary n, we can find a k, say k = [yn], so that k? < n < (k + 1}. For 
such a k and nonnegative X1, X2,..., it holds that 
k =  (k+1)? 


Gp ere 


Since Xp and Xar converge almost surely to u as k (and hence n) goes to infinity, we 
conclude that X, +> p. o 


Note that the condition EX? < œ in Theorem C.16 can be weakened to E|X| < œ and 
the iid condition on the variables X,,..., X„ can be relaxed to mere pairwise independence. 
The corresponding proof, however, is significantly more difficult. 

The Central Limit Theorem describes the approximate distribution of X,,, and it applies 
to both continuous and discrete random variables. Loosely, it states that 


the average of a large number of iid random variables 
approximately has a normal distribution. 


Specifically, the random variable X,, has a distribution that is approximately normal, with 
expectation u and variance o7/n. 


Theorem C.17: Central Limit Theorem 





Proof: Let Z; := (Xy — u)/c for all k, so that E Z = 0 and EZ? = 1. We thus need to show 


that Vn Zn 4 N(0, 1). We again use the properties of the characteristic function. Let wz 
be the characteristic function of an iid copy of Z, then due to the iid assumption a similar 
calculation to the one in (C.37) yields: 


Yz (D = Ee” Y% = [yz(t/ Vn". 


CENTRAL LIMIT 
THEOREM 


ns 44] 
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M-ESTIMATOR 


An application of Taylor’s Theorem B.1 in the neighborhood of t = 0 yields 


t , P ” 3 7,,3/2 
mT [n°]. 


Since y4(0) = E Sei” l-o =i EZ = 0 and y3 (0) =i? EZ? = -1, we have: 


w2(t/Vn) = 1+ 


£ 


5 + o(1/n) — eF n> 00, 
n 


haz, = [uzav] =| - 


-/2 


From Example C.4, we recognize e“ '~ as the characteristic function of the standard normal 


distribution. Thus, from Theorem C.12 we conclude that yn Z,, as N(O, 1). oO 


Figure C.6 shows the central limit theorem in action. The left part shows the pdfs of 
X1,2X>,...,4X4 for the case where the {X;} have a U[0, 1] distribution. The right part 
shows the same for the Exp(1) distribution. In both cases, we clearly see convergence to a 
bell-shaped curve, characteristic of the normal distribution. 


n=1 Ti 





0.8} 
0.8} 

0.6 + oey 
0.4} 0.4 f 
0.2} 0.2L 











0 








I 
i 
0 1 2 3 4 


Figure C.6: Illustration of the central limit theorem for (left) the uniform distribution and 
(right) the exponential distribution. 


The multivariate version of the central limit theorem is the basis for many asymptotic 
(in the size of the training set) results in machine learning and data science. 


Theorem C.18: Multivariate Central Limit Theorem 





One application is as follows. Suppose that a parameter of interest, 0", is the unique 
solution of the system of equations E Y(X | 6*) = 0, where w is a vector-valued (or multi- 
valued) function and the distribution of X does not depend on @. An M-estimator of 0", 
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denoted 6,, is the solution to the system of equations that results from approximating the 
expectation with respect to X using an average of n iid copies of X: 


_ p€ 
Y0) := — POO, 


Thus, ,(0,) = 0. 


Theorem C.19: M-estimator 


S 398 





Proof: We give a proof under the simplifying assumption? that @, is a unique root, that is, 
for any 0 and e, there exists a 6 > O such that ||0„ — Ol| > £ implies that |lYy (Ol| > ô. 


First, we argue that 6, 4, 6"; that is, PIO, — @|| > e] — 0. From the multivariate 
extension of Theorem C.15, we have that 
TT P FZ * * 
Y, (0) — E y,(0) =EY(X| 0) = 0. 
Therefore, using the uniqueness of 6,, we can show that 6, SS 0° via the bound: 


PIO, - 6°ll > e| < P [E0 > 6] = P [P0 - Ey, @)l > 6] > 0, n> æ. 


Second, we take a Taylor expansion of each component of the vector W(0,) around 6* to 
obtain: re pi 

Wn) = WO) + JaN, - 0°), 
where J„(0) is the Jacobian of w, at 0, and 6’ lies on the line segment joining 6, and 6°. 
Rearrange the last equation and multiply both sides by yn A“! to obtain: 


-A'J,0)Va @, - 0) = AT! yn Y, (0). 
By the central limit theorem, yn W(0") converges in distribution to N(0, B). Therefore, 


~A-'J,(6’)Vn @, — 6°) > N0, A“'BA~"). 


Theorem C.15 (the weak law of large numbers) applied to the iid random matrices 
(2 W(X; | 0)} shows that 
ð 


J,(0) >E 5p VX 8). 


Moreover, since 6, = 6" and J, is continuous in 0, we have that J,,(0’) 5 —A. Therefore, 
by Slutsky’s theorem, —A~!J,,(0’)-V7 (@,, — 6°) — yn (0, - 6°) “0. o rs 444 





3The result holds under far less stringent assumptions. 
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LAPLACE’ S 
APPROXIMATION 


r= 400 


Finally, we mention Laplace’s approximation, which shows how integrals or expecta- 
tions behave under the normal distribution with a vanishingly small variance. 


Theorem C.20: Laplace’s Approximation 





Proof: (Sketch for a bounded domain ©.) The left-hand side of (C.39) can be written as 
the expectation with respect to the N(@,, X,,/n) distribution: 


2(0 = 0a)" E; "(0 — 0n) 


yl27 Xl a ay = y[27X,| Ele(X)1{X, € O}, 


where X„ ~ N(0„, £„/n). Let Z ~ N(0, D). Then, 0, + £}? Z /yn has the same distribution 
as X,„ and (4, + ri 27, /vn) — @ as n — œ. By continuity of g(@)1{@ € ©} in the interior 


of ©, as n > oo:4 


Ele(X,) HX, € @}] =E |s (6, a)i (o, =) e o} —s 9(6°)1{6" € O). 








Since 6 lies in the interior of ©, we have 1{0* € ©} = 1, completing the proof. Oo 


As an application of Theorem C.20 we can show the following. 


Theorem C.21: Approximation of Integrals 





Proof: We only sketch the proof of (C.40). Let H(@) be the Hessian matrix of r at 0. By 
Taylor’s theorem we can write 


ane 


r(0) — r(@*) = (0 - 6)’ ——~ += lø- 6°) HOO - 6), 


=0 





“We can exchange the limit and expectation, as g(0)L{0 € ©} < maxgco (8) and [, maxgeo g(0) dd = 
|©| maxgeo g(0) < ow, 
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where @ is a point that lies on the line segment joining 6° and @. Since 6* is a unique 
global minimum, there must be a small enough neighborhood of 6", say ©, such that r is a 
strictly (also known as strongly) convex function on ©. In other words, H(@) is a positive rs 403 
definite matrix for all 0 € © and there exists a smallest positive eigenvalue A; > 0 such 
that x"H(@)x > Aj||x||* for all x. In addition, since the maximum eigenvalue of H(@) is a 
continuous function of 0 € © and © is bounded, there must exist a constant A, > A, such 
that x" H(0)x < Aj||x||? for all x. In other words, denoting r* := r(@"), we have the bounds: 





À a 
- File -PP< -@)-r')<-File-o?, 0 eo. 
Therefore, 
err i g0) eI- gg < f (Oe Oda < e™” Í g0) eFI- gg, 
© © © 


An application of Theorem C.20 yields In h 2(6)e"" do = O(e™"”/n?!?) and, more im- 
portantly, 


In f 2(0)e"" do = -nr* — È Inn. 
. 2 


Thus, the proof will be complete once we show that b g(0)e™"® d0, with © := R? \ @, is 
asymptotically negligible compared to i g(0) e”"® d0. Since 6* is a global minimum that 


lies outside any neighborhood of ©, there must exists a constant c > 0 such that r(@)—r* > c 
for all 0 € ©. Therefore, 


SO dosed | een error dg 
© 


© 
< e70 Dr [s0 e7® en De da 
© 


< e70- DF +c) Í (0) er) dé = Ole” *), 
R? 


The last expression is of order o(e™”” /n?’?), concluding the proof. o 


C.10 Markov Chains 


Definition C.6: Markov Chain 


A Markov chain is a collection {X,t = 0,1,2,...} of random variables (or ran- MARKOV CHAIN 
dom vectors) whose futures are conditionally independent of their pasts given their 


present values. That is, 


PIX TEAK 5 < t] = PXA TEAK] forall (C.41) 





In other words, the conditional distribution of the future variable X,,,, given the entire 
past {X,, s < t}, is the same as the conditional distribution of X;,; given only the present X;. 


Property (C.41) is called the Markov property. Markov 
PROPERTY 





452 C.10. Markov Chains 
The index ¢ in X, is usually seen as a “time” or “step” parameter. The index set 
{0, 1,2, ...}in the definition above was chosen out of convenience. It can be replaced by any 
TIME- countable index set. We restrict ourselves to time-homogeneous Markov chains — Markov 
es chains for which the conditional pdfs fx,,,;x,(y | x) do not depend on t; we abbreviate these 
TRANSITION as q0 |x). The {q0 | x)} are called the (one-step) transition densities of the Markov chain. 
PENGENE Note that the random variables or vectors {X,} may be discrete (e.g., taking values in some 
set {1,...,r}) or continuous (e.g., taking values in an interval [0, 1] or R®). In particular, in 
the discrete case, each q0 | x) is a probability: qO |x) = P[Xj41 = y |X; = x]. 
INITIAL The distribution of Xo is called the initial distribution of the Markov chain. The one- 
Eee step transition densities and the initial distribution completely specify the distribution of 
is 43] the random vector [Xo, X;,...,X;]'. Namely, we have by the product rule (C.17) and the 
Markov property that the joint pdf is given by 
Fro s.X (X00 +++ > Xt) = fxo(%0) fxi 1X0 (%1 | X0) ++ + Fx, |X -1,...Xo(%t | X1 +++» X0) 
= fxo(%o) Sx, | xo (1 | X0) ++ + Ax, [x Or | X-1) 
= fxo(%0) qı | X0) q2 | 41) +++ Axr | 4-1). 
ERGODIC A Markov chain is said to be ergodic if the probability distribution of X, converges to 


LIMITING PDF 


GLOBAL BALANCE 
EQUATIONS 


REVERSIBLE 


REVERSE 
MARKOV CHAIN 


a fixed distribution as t — oo. Ergodicity is a property of many Markov chains. Intuitively, 
the probability of encountering the Markov chain in a state x at a time t far into the future 
should not depend on the f, provided that the Markov chain can reach every state from any 
other state — such Markov chains are said to be irreducible — and does not “escape” to 
infinity. Thus, for an ergodic Markov chain the pdf fy (x) converges to a fixed limiting pdf 
f(x) as t > œ, irrespective of the starting state. For the discrete case, f(x) corresponds to 
the long-run fraction of times that the Markov process visits x. 

Under mild conditions (such as irreducibility) the limiting pdf f(x) can be found by 
solving the global balance equations: 


(discrete case), 


f= ie f(y) aly) (C42) 


f £0) aly) dy 


(continuous case). 


For the discrete case the rationale behind this is as follows. Since f(x) is the long-run 
proportion of time that the Markov chain spends in x, the proportion of transitions out of 
x is f(x). This should be balanced with the proportion of transitions into state x, which is 
dy SO) gly). 

One is often interested in a stronger type of balance equations. Imagine that we have 
taken a video of the evolution of the Markov chain, which we may run in forward and 
reverse time. If we cannot determine whether the video is running forward or backward 
(we cannot determine any systematic “looping”, which would indicate in which direction 
time is flowing), the chain is said to be time-reversible or simply reversible. 

Although not every Markov chain is reversible, each ergodic Markov chain, when run 
backwards, gives another Markov chain — the reverse Markov chain — with transition 
densities g(y|x) = f(y) q(x|y)/f(x). To see this, first observe that f(x) is the long-run 
proportion of time spent in x for both the original and reverse Markov chain. Secondly, 
the “probability flux” from x to y in the reversed chain must be equal to the probability 
flux from y to x in the original chain, meaning f(x) g(y|x) = f(y) q(x|y), which yields the 
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stated transition probabilities for the reversed chain. In particular, for a reversible Markov 
chain we have 
f(x) dQ 1x) = fQ) g@ly) forall x,y. (C.43) 


These are the detailed (or local) balance equations. Note that the detailed balance equa- 
tions imply the global balance equations. Hence, if a Markov chain is irreducible and there 
exists a pdf such that (C.43) holds, then f(x) must be the limiting pdf. In the discrete state 
space case an additional condition is that the chain must be aperiodic, meaning that the 
return times to the same state cannot always be a multiple of some integer > 2. 


E Example C.9 (Random Walk on a Graph) Consider a Markov chain that performs a 
“random walk” on the graph in Figure C.7, at each step jumping from the current vertex 
(node) to one of the adjacent vertices, with equal probability. Clearly this Markov chain is 
reversible. It is also irreducible and aperiodic. Let f(x) denote the limiting probability that 
the chain is in vertex x. By symmetry, f(1) = f(2) = f(T) = f(8), f(4) = f(5) and f(3) = 
(6). Moreover, by the detailed balance equations, f(4)/5 = f(1)/3, and f(3)/4 = f(1)/3. 
It follows that f(1) +--+ f(8) = 4f(1) +2 x 5/3 fd) + 2 x 4/3 f(1) = 10 fC) = 1, so 
that f(1) = 1/10, f(3) = 2/15, and f(4) = 1/6. 


4 7 











5 8 


Figure C.7: The random walk on this graph is reversible. 


C.11 Statistics 


Statistics deals with the gathering, summarization, analysis, and interpretation of data. The 
two main branches of statistics are: 


1. Classical or frequentist statistics: Here the observed data t is viewed as the out- 
come of random data 7 described by a probabilistic model — usually the model is 
specified up to a (multidimensional) parameter; that is, 7 ~ g(-|@) for some 8. The 
statistical inference is then purely concerned with the model and in particular with 
the parameter 0. For example, on the basis of the data one may wish to 


(a) estimate the parameter, 
(b) perform statistical tests on the parameter, or 


(c) validate the model. 


LOCAL BALANCE 
EQUATIONS 


APERIODIC 


FREQUENTIST 
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ESTIMATE 
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ERROR 
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2. Bayesian statistics: In this approach we average over all possible values of the 
parameter 0 using a user-specified weight function g(@) and obtain the model 
JT ~ f g(- 10) (0) dé. For practical computations, this means that we can treat 0 as a 
random variable with pdf (0). Bayes’ formula g(0 |T) œ g(r |0) (0) is used to learn 
0 based on the observed data 7. 


m Example C.10 (Iid Sample) The most fundamental statistical model is where the data 
T = X,...,X,, is such that the random variables X,,..., X, are assumed to be independent 
and identically distributed: 
Xie Xa ~ Dist, 

according to some known or unknown distribution Dist. An iid sample is often called a 
random sample in the statistics literature. Note that the word “sample” can refer to both a 
collection of random variables and to a single random variable. It should be clear from the 
context which meaning is being used. 

Often our guess or model for the true distribution is specified up to an unknown para- 
meter 0, with 0 € ©. The most common model is: 

iid 


Xis... Xn ~ Nu, 0°), 


in which case 0 = (u, o°) and © = R x R4. E 


C.12 Estimation 


Suppose the model g(-|@) for the data 7 is completely specified up to an unknown para- 
meter vector 0. The aim is to estimate 0 on the basis of the observed data t only (an altern- 
ative goal could be to estimate 7 = w(@) for some vector-valued function y). Specifically, 
the goal is to find an estimator T = T(T ) that is close to the unknown @. The correspond- 
ing outcome tf = T(r) is the estimate of @. The bias of an estimator T of @ is defined as 
ET — 0. An estimator T of @ is said to be unbiased if EgT = 0. We often write @ for both 
an estimator and estimate of 0. The mean squared error (MSE) of a real-valued estimator 
T is defined as 
MSE = E,(T — 6)’. 


An estimator T; is said to be more efficient than an estimator T, if the MSE of T; is smaller 
than the MSE of T2. The MSE can be written as the sum 


MSE = (ET — 0}? + VargT. 


The first term measures the unbiasedness and the second is the variance of the estimator. 
In particular, for an unbiased estimator the MSE of an estimator is simply equal to its 
variance. 

For simulation purposes it is often important to include the running time of the estim- 
ator in efficiency comparisons. One way to compare two unbiased estimators Tı and T; is 
to compare their relative time variance products, 


Fi Var T; 
ET) ’ 





i= 1,2, (C.44) 
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where rı and r, are the times required to calculate the estimators T, and T3, respectively. 

In this scheme, T; is considered more efficient than T; if its relative time variance product 

is smaller. We discuss next two systematic approaches for constructing sound estimators. 

C.12.1 Method of Moments 

Suppose x),...,X, are outcomes from an iid sample X\,...,Xn ~iia g(x|6), where 0 = 

[0;,...,]' is unknown. The moments of the sampling distribution can be easily estim- 

ated. Namely, if X ~ g(x|9), then the r-th moment of X, that is u,(@) = EọX” (assuming 

it exists), can be estimated through the sample r-th moment: 1 D1 x}. The method of mo- E 
ments involves choosing the estimate 6 of @ such that each of the first k sample and true MOMENT 
moments are matched: goen 


ly — 
— r= (O), S d Zaak: 
pa HAO), r 


In general, this set of equations is nonlinear and so its solution often has to be found 
numerically. 


E Example C.11 (Sample Mean and Sample Variance) Suppose the data is given by 
T ={X,...,Xn,}, where the {X;} form an iid sample from a general distribution with mean 
u and variance go? < co. Matching the first two moments gives the set of equations 


iv 
Da 


1 n 
7 ye = +o". 
i=l 


The method of moments estimates for u and o? are therefore the sample mean 


~ ._ 1x 
Ra=19'm (C45) 
and 
=. jg ioe 
C= . > x? — (x) = m Ja — x). (C.46) 
i=l i=1 


The corresponding estimator for u, X, is unbiased. However, the estimator for a% is biased: 
Eo? = g?°(n — 1)/n. An unbiased estimator is the sample variance 





= A 1 ë = 

S? =o? = — ) (X - X”. 
7 n-1 n-Il Di ) 

Its square root, $ = VS2, is called the sample standard deviation. Oo 


E Example C.12 (Sample Covariance Matrix) The method of moments can also be 
used to estimate the covariance matrix of a random vector. In particular, let the X,,...,X;, 


SAMPLE MEAN 


SAMPLE VARIANCE 


SAMPLE 
STANDARD 
DEVIATION 
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be iid copies of a d-dimensional random vector X with mean vector u and covari- 
ance matrix X. We assume n > d. The moment estimator for u is, as in the d = 1 case, 
ns 430 X = (X, +---+X,,)/n. As the covariance matrix can be written (see (C.15)) as 


Y= E(X - p(X - py)", 


the method of moments yields the estimator 
Ps eo — = 
E=- ya -XX -5'". (C.47) 
n i=1 


Similar to the one-dimensional case (d = 1), replacing the factor 1/n with 1/(n — 1) gives 


SAMPLE an unbiased estimator, called the sample covariance matrix. E 
COVARIANCE 
MATRIX 


C.12.2 Maximum Likelihood Method 


The concept of likelihood is central in statistics. It describes in a precise way the informa- 
tion about model parameters that is contained in the observed data. 

Let 7 be a (random) data object that is modeled as a draw from the pdf g(t |0) (dis- 
crete or continuous) with parameter vector 0 € ©. Let t be an outcome of J. The function 
L(@|T) := g(T|0), 0 € ©, is called the likelihood function of 0, based on t. The (nat- 


LIKELIHOOD 

FUNCTION ural) logarithm of the likelihood function is called the log-likelihood function and is often 
LOG-LIKELIHOOD denoted by a lower case l. 

FUNCTION 


A Note that L(@|tT) and g(t |0) have the same formula, but the first is viewed as a 


function of 0 for fixed t, where the second is viewed as a function of t for fixed 8. 





The concept of likelihood is particularly useful when 7 is modeled as an iid sample 
{X|,...,X,} from some pdf g. In that case, the likelihood of the data T = {x,,...,x,}, as a 
function of 0, is given by the product 


LOIT) =| | ai16). (C.48) 


i=1 


Let t be an observation from 7 ~ g(t|@), and suppose that g(7|@) takes its largest 
value at 0 = 0. In a way this @ is our best estimate for 0, as it maximizes the probability 
(density) for the observation 7. It is called the maximum likelihood estimate (MLE) of 8. 
Note that 0 = @(t) is a function of t. The corresponding random variable, also denoted @ is 

MAXIMUM the maximum likelihood estimator (also abbreviated as MLE). 

n. Maximization of L(@|T) as a function of @ is equivalent (when searching for the max- 
imizer) to maximizing the log-likelihood /(@|7), as the natural logarithm is an increasing 
function. This is often easier, especially when F is an iid sample from some sampling 
distribution. For example, for L of the form (C.48), we have 


KOIT) = >) In gil 4). 


i=1 
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If /(@| 7) is a differentiable function with respect to 0 and the maximum is attained in the 
interior of ©, and there exists a unique maximum point, then we can find the MLE of 0 by 
solving the equations 


0 
— = j= 1,...,d. 
96,8!) 0, i yd 


E Example C.13 (Bernoulli Random Sample) Suppose we have data T, = {x),...,Xy} 
and assume the model X),..., Xn ~iia Ber(@). Then, the likelihood function is given by 


L(@|t) = | [ea -o'*%*=@(1-6)"", 0<6<1, (C.49) 
i=l 
where s := x) +---+x, =: nx. The log-likelihood is /(@) = sIn@+(n—s)In(1 — 0). Through 
differentiation with respect to 6, we find the derivative 


n-s S n 


I-00 64-0 1-0 











(C.50) 


Bla 


Solving g’(@) = 0 gives the ML estimate 0 = ¥ and ML estimator 6 = X. E 


C.13 Confidence Intervals 


An essential part in any estimation procedure is to provide an assessment of the accuracy 
of the estimate. Indeed, without information on its accuracy the estimate itself would be 
meaningless. Confidence intervals (also called interval estimates) provide a precise way of 
describing the uncertainty in the estimate. 

Let X,,...,X, be random variables with a joint distribution depending on a parameter 
6 € ©. Let Tı < T, be statistics; that is, T; = T;(X,,...,Xn,), i = 1,2 are functions of the 
data, but not of 8. 


1. The random interval (T1, T2) is called a stochastic confidence interval for 0 with 
confidence 1 — a if 


PalTı <9<To])21-a_ forall G€O. (C.51) 


2. Ift; and h are the observed values of T; and T, then the interval (t4, t2) is called the 
(numerical) confidence interval for @ with confidence | — a for every 6 € ©. 


3. If the right-hand side of (C.51) is merely a heuristic estimate or approximation of 
the true probability, then the resulting interval is called an approximate confidence 
interval. 


4. The probability Pg[7, < @ < Tə] is called the coverage probability. For a 1 — a 
confidence interval, it must be at least 1 — a. 


For multidimensional parameters @ € R the stochastic confidence interval is replaced 
with a stochastic confidence region C C R! such that Pe[@ € C] > 1 — a for all 0. 
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CRITICAL REGION 


CRITICAL VALUES 


E Example C.14 (Approximate Confidence Interval for the Mean) Let X1, X,...,X, 
be an iid sample from a distribution with mean u and variance ao” < co (both assumed 
to be unknown). By the central limit theorem and the law of large numbers, 


X= i apace, 
T = — ~~ NO, 1), 
S/n (0, 1) 


for large n, where S is the sample standard deviation. Rearranging the approximate equality 
PUT] < Zi-e2] ~ 1 — œ, where z)~9/2 is the 1 — a@/2 quantile of the standard normal 
distribution, yields 


= S — S 
P|X - zi- — <S H S X + zi-an—| = l-g, 
| rea fea a 
so that r P 
X -zien —, X +Zzį-en— |, abbreviated as X + z} -an —, C.52 
1-a/2 a 1-a/2 = 1-a/2 (C.52) 
is an approximate stochastic (1 — œ) confidence interval for u. E 


Since (C.52) is an asymptotic result only, care should be taken when applying it to 
cases where the sample size is small or moderate and the sampling distribution is heavily 
skewed. 


C.14 Hypothesis Testing 


Suppose the model for the data 7 is described by a family of probability distributions that 
depend on a parameter 0 € ©. The aim of hypothesis testing is to decide, on the basis of 
the observed data r, which of two competing hypotheses holds true; these being the null 
hypothesis, Hy : 0 € ©o, and the alternative hypothesis, H, : 0 € ©}. 

In classical statistics the null hypothesis and alternative hypothesis do not play equival- 
ent roles. Hp contains the “status quo” statement and is only rejected if the observed data 
are very unlikely to have happened under Ho. 

The decision whether to accept or reject Hp is dependent on the outcome of a test 
statistic T = T(J). For simplicity, we discuss only the one-dimensional case T = T. Two 
(related) types of decision rules are generally used: 


1. Decision rule 1: Reject Ho if T falls in the critical region. 
Here the critical region is any appropriately chosen region in R. In practice a critical 
region is one of the following: 
e left one-sided: (—~,c], 
e right one-sided: |c, ~), 
e two-sided: (—00o, c1] U [c2, œ). 
For example, for a right one-sided test, Ho is rejected if the outcome of the test 


statistic is too large. The endpoints c, cı, and c2 of the critical regions are called 
critical values. 


Appendix C. Probability and Statistics 459 





2. Decision rule 2: Reject Ho if the P-value is smaller than some significance level a. 
The P-value is the probability that, under Ho, the (random) test statistic takes a value P-VALUE 
as extreme as or more extreme than the one observed. In particular, if tis the observed 
outcome of the test statistic T, then 
e left one-sided test: P := Ph [T < t], 
e right one-sided: P := Ph [T > t], 
e two-sided: P := min{2Py,[T < t], 2Pm [T > tI}. 


The smaller the P-value, the greater the strength of the evidence against Ho provided 
by the data. As a rule of thumb: 


P<0.10 suggestive evidence, 
P <0.05 reasonable evidence, 


P<0.01 strong evidence. 


Whether the first or the second decision rule is used, one can make two types of errors, 
as depicted in Table C.4. 


Table C.4: Type I and II errors in hypothesis testing. 


True statement 





Decision Ho is true H; is true 





Accept Ho || Correct Type II Error 

















Reject Ho || Type I Error | Correct 





The choice of the test statistic and the corresponding critical region involves a multiob- 
jective optimization criterion, whereby both the probabilities of a type I and type II error 
should, ideally, be chosen as small as possible. Unfortunately, these probabilities compete 
with each other. For example, if the critical region is made larger (smaller), the probability 
of a type II error is reduced (increased), but at the same time the probability of a type I 
error is increased (reduced). 

Since the type I error is considered more serious, Neyman and Pearson [93] suggested 
the following approach: choose the critical region such that the probability of a type II error 
is as small as possible, while keeping the probability of a type I error below a predetermined 


small significance level a. SIGNIFICANCE 
LEVEL 


@ Remark C.3 (Equivalence of Decision Rules) Note that decision rule 1 and 2 are 
equivalent in the following sense: 


Reject Ho if T falls in the critical region, at significance level a. 
© 
Reject Ho if the P-value is < significance level a. 
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In other words, the P-value of the test is the smallest level of significance that would lead 
to the rejection of Ho. m 


ud 


In general, a statistical test involves the following steps: 


(É 


1. Formulate an appropriate statistical model for the data. 


2. Give the null (Ho) and alternative (H1) hypotheses in terms of the parameters 
of the model. 


. Determine the test statistic (a function of the data only). 
. Determine the (approximate) distribution of the test statistic under Ho. 
. Calculate the outcome of the test statistic. 


. Calculate the P-value or the critical region, given a preselected significance 
level a. 


. Accept or reject Ho. 





The actual choice of an appropriate test statistic is akin to selecting a good estimator 
for the unknown parameter 0. The test statistic should summarize the information about 8 
and make it possible to distinguish between the alternative hypotheses. 


E Example C.15 (Hypothesis Testing) We are given outcomes x1, ..., Xm and yj,..-, Yn 
of two simulation studies obtained via independent runs, with m = 100 and n = 50. The 
sample means and standard deviations are x = 1.3, sy = 0.1 and y = 1.5, sy = 0.3. Thus, 
the {x;} are outcomes of iid random variables {X;}, the {y;} are outcomes of iid random 
variables {Y;}, and the {X;} and {Y;} are independent. We wish to assess whether the expect- 
ations ux = EX; and uy = EY; are the same or not. Going through the 7 steps above, we 
have: 


1. The model is already specified above. 

2. Ho : Ux — Hy = 0 versus Hy : uy — uy + 0. 

3. For similar reasons as in Example C.14, take 
X-Y 


4|S}/m+ S}/n 


4. By the central limit theorem, the statistic T has, under Hp, approximately a standard 
normal distribution (assuming the variances are finite). 


5. The outcome of T is t = (x — Y)/4/s4/m + s}/n x —4.59. 


6. As this is a two-sided test, the P-value is 2Py,[T < —4.59] x 4- 1076. 


T= 


7. Because the P-value is extremely small, there is overwhelming evidence that the two 
expectations are not the same. 
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Further Reading 


Accessible treatises on probability and stochastic processes include [27, 26, 39, 54, 101]. 
Kallenberg’s book [61] provides a complete graduate-level overview of the foundations of 
modern probability. Details on the convergence of probability measures and limit theorems 
can be found in [11]. For an accessible introduction to mathematical statistics with simple 
applications see, for example, [69, 74, 124]. For a more detailed overview of statistical 
inference, see [10, 25]. A standard reference for classical (frequentist) statistical inference 
is [78]. 
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APPENDIX D 





PYTHON PRIMER 





Python has become the programming language of choice for many researchers and 
practitioners in data science and machine learning. This appendix gives a brief intro- 
duction to the language. As the language is under constant development and each year 
many new packages are being released, we do not pretend to be exhaustive in this in- 
troduction. Instead, we hope to provide enough information for novices to get started 
with this beautiful and carefully thought-out language. 


D.1 Getting Started 


The main website for Python is 
https://www.python.org/, 


where you will find documentation, a tutorial, beginners’ guides, software examples, and 
so on. It is important to note that there are two incompatible “branches” of Python, called 
Python 3 and Python 2. Further development of the language will involve only Python 3, 
and in this appendix (and indeed the rest of the book) we only consider Python 3. As there 
are many interdependent packages that are frequently used with a Python installation, it 
is convenient to install a distribution — for instance, the Anaconda Python distribution, 
available from 


https://www.anaconda.com/. 


The Anaconda installer automatically installs the most important packages and also 
provides a convenient interactive development environment (IDE), called Spyder. 


Use the Anaconda Navigator to launch Spyder, Jupyter notebook, install and update 


packages, or open a command-line terminal. 





To get started', try out the Python statements in the input boxes that follow. You can 
either type these statements at the [Python command prompt or run them as (very short) 





'We assume that you have installed all the necessary files and have launched Spyder. 
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MUTABLE 


IMMUTABLE 


SLICE 


Python programs. The output for these two modes of input can differ slightly. For ex- 
ample, typing a variable name in the console causes its contents to be automatically printed, 
whereas in a Python program this must be done explicitly by calling the print function. 
Selecting (highlighting) several program lines in Spyder and then pressing function key” 
F9 is equivalent to executing these lines one by one in the console. 

In Python, data is represented as an object or relation between objects (see also Sec- 
tion D.2). Basic data types are numeric types (including integers, booleans, and floats), 
sequence types (including strings, tuples, and lists), sets, and mappings (currently, diction- 
aries are the only built-in mapping type). 

Strings are sequences of characters, enclosed by single or double quotes. We can print 
strings via the print function. 


print("Hello World!") 





Hello World! 


For pretty-printing output, Python strings can be formatted using the format function. The 
bracket syntax {i} provides a placeholder for the i-th variable to be printed, with 0 being 
the first index. Individual variables can be formatted separately and as desired; formatting 
syntax is discussed in more detail in Section D.9. 


print("Name:{1} (height {2} m, age {0})".format(111,"Bilbo" ,0.84)) 


Name:Bilbo (height 0.84 m, age 111) 


Lists can contain different types of objects, and are created using square brackets as in the 
following example: 





[1,'string',"another string"] # Quote type is not important 


"string' "another string'] 


Elements in lists are indexed starting from 0, and are mutable (can be changed): 





x= H2] 
x[0] = 2 # Note that the first index is 0 





In contrast, tuples (with round brackets) are immutable (cannot be changed). Strings are 
immutable as well. 


žġ= (1,2) 
x[0] = 2 


TypeError: 'tuple' object does not support item assignment 





Lists can be accessed via the slice notation [start:end]. It is important to note that end 
is the index of the first element that will not be selected, and that the first element has index 
0. To gain familiarity with the slice notation, execute each of the following lines. 





This may depend on the keyboard and operating system. 
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yy, ho T aly aCe. 728) 
# Elements with index from 1 to 3 
# All elements with index less than 4 
# All elements with index 3 or more 
# The last two elements 


[3, 7] 

[2 Sp al 

E7 Ai, 13y 17, 19, 25I 
119, 231 


An operator is a programming language construct that performs an action on one or more 
operands. The action of an operator in Python depends on the type of the operand(s). For 
example, operators such as +, x, —, and % that are arithmetic operators when the operands 
are of a numeric type, can have different meanings for objects of non-numeric type (such 
as strings). 


"hello' + 'world' # String concatenation 
"helloworld' 


“hello” * 2 # String repetition 
"hellohello' 


2a) * # List repetition 


2 


15 % 4 # Remainder of 15/4 


Some common Python operators are given in Table D.1. 





D.2 Python Objects 


As mentioned in the previous section, data in Python is represented by objects or relations 
between objects. We recall that basic data types included strings and numeric types (such 
as integers, booleans, and floats). 

As Python is an object-oriented programming language, functions are objects too 
(everything is an object!). Each object has an identity (unique to each object and immutable 
— that is, cannot be changed — once created), a type (which determines which operations 
can be applied to the object, and is considered immutable), and a value (which is either 
mutable or immutable). The unique identity assigned to an object obj can be found by 
calling id, as in id(obj). 

Each object has a list of attributes, and each attribute is a reference to another object. 
The function dir applied to an object returns the list of attributes. For example, a string 
object has many useful attributes, as we shall shortly see. Functions are objects with the 
__call__ attribute. 
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A class (see Section D.8) can be thought of as a template for creating a custom type of 
object. 


s = “pello” 
d = dir(s) 
print(d,flush=True) # Print the list in "flushed" format 


$ 


[oe all Ti a elass e uue coNnta ins S e dela ttr u udara 
(many left out) ... 'replace', 'rfind', 
"rindex', 'rjust', 'rpartition', '‘rsplit', 'rstrip', spliti, 
"splitlines', 'startswith', 'strip', 'swapcase', ‘'title', 
"translate', ‘upper', 'zfill'] 





DOT NOTATION Any attribute attr of an object obj can be accessed via the dot notation: obj .attr. To 
find more information about any object use the help function. 


s = “hello” 
help(s.replace) 


replace(...) method of builtins.str instance 
S.replace(old, new[, count]) -> str 


Return a copy of S with all occurrences of substring 
old replaced by new. If the optional argument count is 
given, only the first count occurrences are replaced. 





This shows that the attribute replace is in fact a function. An attribute that is a function is 
METHOD called a method. We can use the replace method to create a new string from the old one 
by changing certain characters. 


s = 'hello' 
sl = s.replace('e','a') 
print(s1) 


In many Python editors, pressing the TAB key, as in obj ectname .<TAB>, will bring 
up a list of possible attributes via the editor’s autocompletion feature. 





D.3 Types and Operators 


TYPE Each object has a type. Three basic data types in Python are str (for string), int (for 
integers), and float (for floating point numbers). The function type returns the type of 
an object. 


tl = type ([1,2,3]) 
t2 = type CCl, 2),3)) 


t3 = type({1,2,3}) 
print (ti, t2, t3) 





Appendix D. Python Primer 


467 





<class 'list'> <class 'tuple'> <class 'set'> 


The assignment operator, =, assigns an object to a variable; e.g., x = 12. An expression 
is a combination of values, operators, and variables that yields another value or variable. 


N . ae, . 

= Variable names are case sensitive and can only contain letters, numbers, and under- 
ṣ scores. They must start with either a letter or underscore. Note that reserved words 
such as True and False are case sensitive as well. 


Python is a dynamically typed language, and the type of a variable at a particular point 
during program execution is determined by its most recent object assignment. That is, the 
type of a variable does not need to be explicitly declared from the outset (as is the case in 
C or Java), but instead the type of the variable is determined by the object that is currently 
assigned to it. 

It is important to understand that a variable in Python is a reference to an object — 
think of it as a label on a shoe box. Even though the label is a simple entity, the contents 
of the shoe box (the object to which the variable refers) can be arbitrarily complex. Instead 
of moving the contents of one shoe box to another, it is much simpler to merely move the 
label. 


[1,2] 
yaa x # y refers to the same object as x 
print(id(x) == id(y)) # check that the object id's are the same 
y[0] = 100 # change the contents of the list that y refers to 
print (x) 


= [1,2] 
x # y refers to the same object as x 
[100,2] # now y refers to a different object 
print(id(x) == id(y)) 
print (x) 


Table D.1 shows a selection of Python operators for numerical and logical variables. 


Table D.1: Common numerical (left) and logical (right) operators. 


+ addition ~ binary NOT 
- subtraction & binary AND 
* multiplication ^ binary XOR 
** power | binary OR 
/ division == equal to 

// integer division != not equal to 


% modulus 








ASSIGNMENT 


REFERENCE 


468 


D.4. Functions and Methods 





FUNCTION 


Several of the numerical operators can be combined with an assignment operator, as in 
x += 1tomeanx = x + 1. Operators such as + and * can be defined for other data types 
as well, where they take on a different meaning. This is called operator overloading, an 
example of which is the use of <List> * <Integer> for list repetition as we saw earlier. 


D.4 Functions and Methods 


Functions make it easier to divide a complex program into simpler parts. To create a 
function, use the following syntax: 


def <function name>(<parameter_list>): 
<statements> 


A function takes a list of input variables that are references to objects. Inside the func- 
tion, a number of statements are executed which may modify the objects, but not the ref- 
erence itself. In addition, the function may return an output object (or will return the value 
None if not explicitly instructed to return output). Think again of the shoe box analogy. The 
input variables of a function are labels of shoe boxes, and the objects to which they refer 
are the contents of the shoe boxes. The following program highlights some of the subtleties 
of variables and objects in Python. 


Note that the statements within a function must be indented. This is Python’s way to 
define where a function begins and ends. 


[1,2,3] 


def change_list(y): 
y.append(100) # Append an element to the list referenced by y 
y [0]=90 # Modify the first element of the same list 
y = [2,3,4] # The local y now refers to a different list 
# The list to which y first referred does not change 
return sum(y) 


print (change_list(x)) 
print (x) 





Variables that are defined inside a function only have local scope; that is, they are 
recognized only within that function. This allows the same variable name to be used in 
different functions without creating a conflict. If any variable is used within a function, 
Python first checks if the variable has local scope. If this is not the case (the variable has 
not been defined inside the function), then Python searches for that variable outside the 
function (the global scope). The following program illustrates several important points. 
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from numpy import array, square, sqrt 


xX = array([1.2,2.3,4.5]) 


def stat(x): 
n = len(x) #the length of x 
meanx = sum(x)/n 
stdx = sqrt(sum(square(x - meanx))/n) 
return [meanx, stdx] 


print (stat(x)) 


[2.6666666666666665, 1.3719410418171119] 


1. Basic math functions such as sqrt are unknown to the standard Python interpreter 
and need to be imported. More on this in Section D.5 below. 





2. As was already mentioned, indentation is crucial. It shows where the function begins 
and ends. 


3. No semicolons? are needed to end lines, but the first line of the function definition 
(here line 5) must end with a colon (:). 


4. Lists are not arrays (vectors of numbers), and vector operations cannot be performed 
on lists. However, the numpy module is designed specifically with efficient vec- 
tor/matrix operations in mind. On the second code line, we define x as a vector 
(ndarray) object. Functions such as square, sum, and sqrt are then applied to 
such arrays. Note that we used the default Python functions len and sum. More on 
numpy in Section D.10. 


5. Running the program with stat (x) instead of print (stat (x) ) in line 11 will not 
show any output in the console. 


To display the complete list of built-in functions, type (using double underscores) 


dir(__builtin__). 





D.5 Modules 


A Python module is a programming construct that is useful for organizing code into 
manageable parts. To each module with name module_name is associated a Python file 
module_name.py containing any number of definitions, e.g., of functions, classes, and 
variables, as well as executable statements. Modules can be imported into other programs 
using the syntax: import <module_name> as <alias_name>, where <alias_name> 
is a shorthand name for the module. 





3Semicolons can be used to put multiple commands on a single line. 


MODULE 
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NAMESPACE 


we 


«( 


When imported into another Python file, the module name is treated as a namespace, 
providing a naming system where each object has its unique name. For example, different 
modules mod1 and mod2 can have different sum functions, but they can be distinguished by 
prefixing the function name with the module name via the dot notation, as inmod1. sum and 
mod2 . sum. For example, the following code uses the sqrt function of the numpy module. 


import numpy as np 
np.sqrt(2) 


1.4142135623730951 





A Python package is simply a directory of Python modules; that is, a collection of 
modules with additional startup information (some of which may be found in its __path__ 
attribute). Python’s built-in module is called __builtins__. Of the great many useful 
Python modules, Table D.2 gives a few. 


Table D.2: A few useful Python modules/packages. 


datetime Module for manipulating dates and times. 
matplotlib Marag -type plotting package 
numpy Fundamental package for scientific computing, including random 


number generation and linear algebra tools. Defines the ubiquitous 
ndarray class. 


os Python interface to the operating system. 

pandas Fundamental module for data analysis. Defines the powerful 
DataFrame class. 

pytorch Machine learning library that supports GPU computation. 

scipy Ecosystem for mathematics, science, and engineering, containing 


many tools for numerical computing, including those for integration, 
solving differential equations, and optimization. 


requests Library for performing HTTP requests and interfacing with the web. 
seaborn Package for statistical data visualization. 

sklearn Easy to use machine learning library. 

statsmodels Package for the analysis of statistical models. 


The numpy package contains various subpackages, such as random, linalg, and fft. 
More details are given in Section D.10. 


When using Spyder, press Ctrl+I in front of any object, to display its help file in a 


separate window. 





As we have already seen, it is also possible to import only specific functions from a 
module using the syntax: from <module_name> import <fncl, fnc2, ...>. 


from numpy import sqrt, cos 
sqrt (2) 


cos(1) 
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1.4142135623730951 
0.54030230586813965 


This avoids the tedious prefixing of functions via the (alias) of the module name. However, 
for large programs it is good practice to always use the prefix/alias name construction, to 
be able to clearly ascertain precisely which module a function being used belongs to. 


D.6 Flow Control 


Flow control in Python is similar to that of many programming languages, with conditional 
statements as well as while and for loops. The syntax for if-then-else flow control is 
as follows. 


if <condition1>: 
<statements> 
elif <condition2>: 
<statements> 
else: 
<statements> 


Here, <conditionl> and <condition2> are logical conditions that are either True or 
False; logical conditions often involve comparison operators (such as ==, >, <=, !=). 
In the example above, there is one elif part, which allows for an “else if” conditional 
statement. In general, there can be more than one elif part, or it can be omitted. The else 
part can also be omitted. The colons are essential, as are the indentations. 

The while and for loops have the following syntax. 


while <condition>: 
<statements> 


for <variable> in <collection>: 
<statements> 


Above, <collection> is an iterable object (see Section D.7 below). For further con- 
trol in for and while loops, one can use a break statement to exit the current loop, and 
the continue statement to continue with the next iteration of the loop, while abandoning 
any remaining statements in the current iteration. Here is an example. 


import numpy as np 
ans = 'y' 
while ans ! n“: 
outcome = np.random.randint(1,6+1) 


if outcome == 6: 


print("Hooray a 6!") 

break 
else: 

printC("Bad luck, a", outcome) 
ans = input("Again? (y/n) ") 
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ITERABLE 
ITERATOR 


SEQUENCE 


RANGE 


SETS 


D.7 Iteration 


Iterating over a sequence of objects, such as used in a for loop, is a common operation. 
To better understand how iteration works, we consider the following code. 


s = Hello" 
for c in s: 


print Ce, "="; end=" T) 





A string is an example of a Python object that can be iterated. One of the methods of a 
string object is __iter__. Any object that has such a method is called an iterable. Calling 
this method creates an iterator — an object that returns the next element in the sequence 
to be iterated. This is done via the method __next__. 


s = "Hello" 

it = s ater O # t is now an iterator. Same as iter(s) 
print(t.__next__() ) # same as next(t) 

Print (t.- next © ) 

print(t.__next__Q ) 





The inbuilt functions next and iter simply call these corresponding double- 
underscore functions of an object. When executing a for loop, the sequence/collection 
over which to iterate must be an iterable. During the execution of the for loop, an iterator 
is created and the next function is executed until there is no next element. An iterator is 
also an iterable, so can be used in a for loop as well. Lists, tuples, and strings are so-called 
sequence objects and are iterables, where the elements are iterated by their index. 


The most common iterator in Python is the range iterator, which allows iteration over 
a range of indices. Note that range returns a range object, not a list. 


for i in range(4,20): 
print(i, end=' ') 
print (range(4,20)) 


4567 8 9 10 11 12 13 14 15 16 17 18 19 
range (4,20) 


Similar to Python’s slice operator [i : j], the iterator range(i, j) ranges from iż to j, 
not including the index j. 





Two other common iterables are sets and dictionaries. Python sets are, as in mathem- 
atics, unordered collections of unique objects. Sets are defined with curly brackets {}, as 
opposed to round brackets ( ) for tuples, and square brackets [ ] for lists. Unlike lists, sets do 
not have duplicate elements. Many of the usual set operations are implemented in Python, 
including the union A | Band intersection A & B. 
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1 in A: 
print (i) 


print (C) 


A useful way to construct lists is by list comprehension; that is, by expressions of the 
form 


<expression> for <element> in <list> if <condition> 


For sets a similar construction holds. In this way, lists and sets can be defined using very 
similar syntax as in mathematics. Compare, for example, the mathematical definition of 
the sets A := {3,2,4, 2} = {2,3, 4} (no order and no duplication of elements) and B := {x? : 
x € A} with the Python code below. 


seti = {315 2, 4, 2} 

setB = {x**2 for x in setA} 
print (setB) 

lustA = [3 2, 4, 2] 

listB = [x2 for x in lista] 
print (listB) 


A dictionary is a set-like data structure, containing one or more key: value pairs en- 
closed in curly brackets. The keys are often of the same type, but do not have to be; the 
same holds for the values. Here is a simple example, storing the ages of Lord of the Rings 
characters in a dictionary. 


DICT = {'Gimly': 140, 'Frodo':51, 'Aragorn': 88} 
for key in DICT: 
print(key, DICT[key]) 


Gimly 140 
Frodo 51 
Aragorn 88 


D.8 Classes 


Recall that objects are of fundamental importance in Python — indeed, data types and 
functions are all objects. A class is an object type, and writing a class definition can be 
thought of as creating a template for a new type of object. Each class contains a number 
of attributes, including a number of inbuilt methods. The basic syntax for the creation of a 
class is: 











LIST 
COMPREHENSION 


DICTIONARY 


CLASS 
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class <class_name>: 
def __init__(self): 
<statements> 
<statements> 


INSTANCE The main inbuilt method is __init__, which creates an instance of a class object. 
For example, str is a class object (string class), but s = str('Hello') or simply 
s = 'Hello', creates an instance, s, of the str class. Instance attributes are created dur- 
ing initialization and their values may be different for different instances. In contrast, the 
values of class attributes are the same for every instance. The variable self in the initializ- 
ation method refers to the current instance that is being created. Here is a simple example, 
explaining how attributes are assigned. 


class shire_person: 
def __init__(self,name): # initialization method 
self.name = name # instance attribute 
self.age = 0 # instance attribute 
address = 'The Shire' # class attribute 


print (dir(shire_person)[1:5],'...',dir(shire_person) [-2:]) 
# list of class attributes 


shire_person('Sam') # create an instance 
shire_person('Frodo') # create another instance 
print (p1l.__dict__) # list of instance attributes 


p2.race = 'Hobbit' # add another attribute to instance p2 
p2.age = 33 # change instance attribute 
print (p2.__dict__) 


print(getattr(p1,'address')) # content of pl's class attribute 


Ea cleileteie YY aachiet 4, Yahi 
['__weakref__', 'address'] 

{'name': 'Sam', 'age': Q} 

{'name': 'Frodo', '‘age': 33, 'race': 'Hobbit'} 
The Shire 





It is good practice to create all the attributes of the class object in the __init__ method, 
but, as seen in the example above, attributes can be created and assigned everywhere, even 
outside the class definition. More generally, attributes can be added to any object that has 
a__dict__. 


` An “empty” class can be created via 
¥ class <class_name>: 
pass 





INHERITANCE Python classes can be derived from a parent class by inheritance, via the following 
syntax. 


class <class_name>(<parent_class_name>): 
<statements> 
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The derived class (initially) inherits all of the attributes of the parent class. 

As an example, the class shire_person below inherits the attributes name, age, and 
address from its parent class person. This is done using the super function, used here 
to refer to the parent class person without naming it explicitly. When creating a new 
object of type shire_person, the __init__ method of the parent class is invoked, and 
an additional instance attribute Shire_address is created. The dir function confirms that 
Shire_address is an attribute only of shire_person instances. 


class person: 
def __init__(Cself,name): 
self.name = name 
self.age = 0 
self.address= 


class shire_person(person): 
def __init__(self,name): 
super().__init__ (name) 


self.Shire_address = 'Bag End' 


pl = shire_person("Frodo") 
p2 = person("Gandalf") 
print (dir(p1)[:1],dir(p1)[-3:] ) 
print (dir(p2)[:1],dir(p2)[-3:] ) 


['Shire_address'] ['address', 'age', 'name'] 
['__class__'] ['address', ‘age', 'name'] 


D.9 Files 


To write to or read from a file, a file first needs to be opened. The open function in Python 
creates a file object that is iterable, and thus can be processed in a sequential manner in a 
for or while loop. Here is a simple example. 


fout = openC'output.txt','w') 
for i in range(0,41): 


if i%10 == 
fout.writeC('{:3d}\n'.format(i)) 
fout.close() 





The first argument of open is the name of the file. The second argument specifies 
if the file is opened for reading ('r'), writing ('w'), appending ('a'), and so on. See 
help(open). Files are written in text mode by default, but it is also possible to write in 
binary mode. The above program creates a file output.txt with 5 lines, containing the 
strings 0, 10, ..., 40. Note that if we had written fout .write(i) in the fourth line of the 
code above, an error message would be produced, as the variable i is an integer, and not a 
string. Recall that the expression string. format () is Python’s way to specify the format 
of the output string. 

The formatting syntax {:3d} indicates that the output should be constrained to a spe- 
cific width of three characters, each of which is a decimal value. As mentioned in the 
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introduction, bracket syntax {i} provides a placeholder for the i-th variable to be printed, 
with 0 being the first index. The format for the output is further specified by {i : format}, 
where format is typically* of the form: 


[width] [.precision] [type] 
In this specification: 


e width specifies the minimum width of output; 


e precision specifies the number of digits to be displayed after the decimal point for 
a floating point values of type f, or the number of digits before and after the decimal 
point for a floating point values of type g; 


e type specifies the type of output. The most common types are s for strings, d for 
integers, b for binary numbers, f for floating point numbers (floats) in fixed-point 
notation, g for floats in general notation, e for floats in scientific notation. 


The following illustrates some behavior of formatting on numbers. 


"£:5d}'. format (123) 
:.4e}'. format (1234567890) 
:.2£}'. format (1234567890) 
:.2£}'. format (2.718281828) 
:.3f£}'. format (2.718281828) 
:.3g}'. format (2.718281828) 
:.3e}'. format (2.718281828) 
"£0:3.3f}; {2:.4e};'. format (123.456789, 0.00123456789) 


7 123" 

"1.2346e+09' 
"1234567890.00' 

"Rafe" 

"2.718' 

UR Tia 

"2.718e+00' 

"123.457; 1.2346e-03;' 





The following code reads the text file output.txt line by line, and prints the output 
on the screen. To remove the newline \n character, we have used the strip method for 
strings, which removes any whitespace from the start and end of a string. 


fin = openC'output.txt','r') 

for line in fin: 
line = line.strip() # strips a newline character 
print (line) 

fin.close() 








4More formatting options are possible. 
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When dealing with file input and output it is important to always close files. Files that 
remain open, e.g., when a program finishes unexpectedly due to a programming error, can 
cause considerable system problems. For this reason it is recommended to open files via 
context management. The syntax is as follows. 


with openC'output.txt', 'w') as f: 


f.write('Hi there!') 





Context management ensures that a file is correctly closed even when the program is 
terminated prematurely. An example is given in the next program, which outputs the most- 
frequent words in Dicken’s A Tale of Two Cities, which can be downloaded from the book’s 
GitHub site as ataleof2cities.txt. 

Note that in the next program, the file ataleof2cities.txt must be placed in the cur- 
rent working directory. The current working directory can be determined via import os 
followed by cwd = os.getcwd(). 


numline = Q 
DICT = {} 
with open(C'ataleof2cities.txt', encoding="utf8") as fin: 
for line in fin: 
words = line.split() 
for w in words: 
if w not in DICT: 
DICT[w] = 1 
else: 
DICT[w] +=1 
numline += 1 


sd = sorted(DICT,key=DICT.get,reverse=True) #sort the dictionary 


printC"Number of unique words: {}\n".format(len(DICT))) 
print("Ten most frequent words:\n") 
printcC"{:8} {}".formatC"word", "count")) 
print (15*'-') 
for i in range(0,10): 
print("{:8} {}".format(sd[i], DICT[sd[i]])) 





Number of unique words: 19091 


Ten most frequent words: 


word count 
the 7348 
and 4679 
of 3949 
to 3387 
a 2768 
in 2390 
his 1911 
was 1672 


that 1650 
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D.10 NumPy 


The package NumPy (module name numpy) provides the building blocks for scientific 
computing in Python. It contains all the standard mathematical functions, such as sin, 
cos, tan, etc., as well as efficient functions for random number generation, linear algebra, 
and statistical computation. 


import numpy as np #import the package 
x = np.cos(1) 

data = T1 203) 455 

y = np.mean(data) 


z = np.std(data) 
print cos G1) = {0:1:8£f) mean = 41} std = {2} .format(x:y;Z2Z)) 


cos(1) = 0.54030231 mean = 3.0 std = 1.4142135623730951 





D.10.1 Creating and Shaping Arrays 


The fundamental data type in numpy is the ndarray. This data type allows for fast matrix 
operations via highly optimized numerical libraries such as LAPACK and BLAS; this in 
contrast to (nested) lists. As such, numpy is often essential when dealing with large amounts 
of quantitative data. 

ndarray objects can be created in various ways. The following code creates a2 x3 x2 
array of zeros. Think of it as a 3-dimensional matrix or two stacked 3 x 2 matrices. 


A = np.zeros([2,3,2]) # 2 by 3 by 2 array of zeros 
print (A) 

print (A. shape) # number of rows and columns 

print (type(A)) # A is an ndarray 


[L 
. Wal 
[ 0. 0.]]] 
(Gy Bey 2) 
<class 'numpy.ndarray '> 





We will be mostly working with 2D arrays; that is, ndarrays that represent ordinary 
matrices. We can also use the range method and lists to create ndarrays via the array 
method. Note that arange is numpy’s version of range, with the difference that arange 
returns an ndarray object. 


= np.array (range (4) ) # equivalent to np.arange(4) 
= np.array([0,1,2,3]) 


= np.array([[1,2,3],[3,2,1]]) 
print Gay Nn b'in" O 





[0 1 2 3] 
[0 1 2 3] 
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Eri 2 s 
[3 2 1]] 


The dimension of an ndarray can be obtained via its shape method, which returns a 
tuple. Arrays can be reshaped via the reshape method. This does not change the current 
ndarray object. To make the change permanent, a new instance needs to be created. 


a = np.array(range(9)) #a is an ndarray of shape (9,) 
print (a. shape) 

A = a.reshape(3, 3) #A is an ndarray of shape (3,3) 
print (a) 

print (A) 


[0 123 45 6 7 8] 
OD 

EOS La] 

35 45 SI] 

[6, 7, 8]] 


One shape dimension for reshape can be specified as —1. The dimension is then 
inferred from the other dimension(s). 





The 'T' attribute of an ndarray gives its transpose. Note that the transpose of a “vector” 
with shape (n, ) is the same vector. To distinguish between column and row vectors, reshape 
such a vector to an n x | and 1 x n array, respectively. 


a = np.arange(3) #1D array (vector) of shape (3,) 
print (a) 

print (a.shape) 

b = a.reshape(-1,1) # 3x1 array (matrix) of shape (3,1) 
print (b) 

print (b.T) 

A = np.arange(9).reshape (3,3) 

print (A.T) 


[0 1 2] 
Gs) 
[C0] 
[1] 
[2]] 
[[0 1 2]] 
[[0 3 6] 
[1 4 7] 
[2 5 8]] 





Two useful methods of joining arrays are hstack and vstack, where the arrays are 


joined horizontally and vertically, respectively. 


np.ones((3,3)) 
np.zeros((3,2)) 


= np.hstack((A,B)) 
print (C) 
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D.10.2 Slicing 


Arrays can be sliced similarly to Python lists. If an array has several dimensions, a slice for 
each dimension needs to be specified. Recall that Python indexing starts at '9' and ends 
at 'len(obj)-1'. The following program illustrates various slicing operations. 


A = np.array(range(9)).reshape (3,3) 

print (A) 

print (A[0]) # first row 

print(A[:,1]) # second column 

print(A[0,1]) # element in first row and second column 
print(A[0:1,1:2]) # (1,1) ndarray containing A[0,1] = 1 
print(A[1:,-1]) # elements in 2nd and 3rd rows, and last column 


LLO 1 2] 
[3 4 5] 
[6 7 8]] 
[0 1 2] 
[1 4 7] 
1 

[[1]] 

[5 8] 





Note that ndarrays are mutable objects, so that elements can be modified directly, without 
having to create a new object. 


A[1:,1] = [0,0] # change two elements in the matrix A above 
print (A) 


[[0, 1, 2] 


[3, 9, 5] 
[6, ©, 8]] 





D.10.3 Array Operations 


Basic mathematical operators and functions act element-wise on ndarray objects. 


= np.array([[2,4],[6,8]]) 
= np.array([[1,1],[2,2]]) 


x 
y 
print (x+y) 


print (np.divide(x,y)) # same as x/y 


Ph 2 455] 
ise 4a) 
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print (np.sqrt(x)) 


[[1.41421356 2. ] 
[2.44948974 2.82842712]] 





In order to compute matrix multiplications and compute inner products of vectors, 
numpy’s dot function can be used, either as a method of an ndarray instance or as a 
method of np. 


print (np.dot(x,y)) 


[[10, 10] 
[22, 22]] 


print (x.dot(x)) # same as np.dot(x,x) 


[[28, 40] 
[60, 88]] 





Since version 3.5 of Python, it is possible to multiply two ndarrays using the @ 
operator (which implements the np.matmul method). For matrices, this is similar to using 
the dot method. For higher-dimensional arrays the two methods behave differently. 


print(x @ y) 


[[10 10] 
[22 22]] 





NumPy allows arithmetic operations on arrays of different shapes (dimensions). Spe- 
cifically, suppose two arrays have dimensions (mı, m, . . . , mp) and (n1, m, . . . , Np), respect- 
ively. The arrays or shapes are said to be aligned if for all i = 1,..., p it holds that 


èe Mi = Ni, Or 
e min{m;,n;} = 1, or 
e either m; or n;, or both are missing. 


For example, shapes (1, 2,3) and (4, 2, 1) are aligned, as are (2,,) and (1,2, 3). However, 
(2,2,2) and (1,2,3) are not aligned. NumPy “duplicates” the array elements across the 
smaller dimension to match the larger dimension. This process is called broadcasting and 
is carried out without actually making copies, thus providing efficient memory use. Below 
are some examples. 


import numpy as np 
A= np.arange(4).reshape(2,2) # (2,2) array 


x1 = np.array([40,500]) # (2,) array 
x2 = xl.reshape(2,1) # (2,1) array 


print(A + x1) # shapes (2,2) and (2,) 
print(A = x2) # shapes (2,2) and (2,1) 





@ OPERATOR 


ALIGNED 


BROADCASTING 
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[[ 40 501] 
[ 42 503]] 


[E © 40] 
[1000 1500]] 





Note that above x1 is duplicated row-wise and x2 column-wise. Broadcasting also applies 
to the matrix-wise operator @, as illustrated below. Here, the matrix b is duplicated across 
the third dimension resulting in the two matrix multiplications 


bl 3] om fa il 


B np.arange(8).reshape(2,2,2) 
b = np.arange(4).reshape (2,2) 
print (B@b) 


[[[ 2 3] 
[ 6 11]] 


[[10 19] 
[14 27]]] 





Functions such as sum, mean, and std can also be executed as methods of an ndarray 
instance. The argument axis can be passed to specify along which dimension the function 
is applied. By default axis=None. 


a = np.array(range(4)).reshape(2,2) 
print(a.sum(axis=0)) #summing over rows gives column totals 


[2, 4] 





D.10.4 Random Numbers 


One of the sub-modules in numpy is random. It contains many functions for random vari- 
able generation. 


import numpy as np 

np.random.seed(123) # set the seed for the random number generator 
= np.random.random() # uniform (0,1) 
= np.random.randint(5,9) # discrete uniform 5,...,8 


= np.random.randn(4) # array of four standard normals 
PEINE, Y, \n Zz) 


0.6964691855978616 7 
[ 1.77399501 -0.66475792 -0.07351368 1.81403277] 





For more information on random variable generation in numpy, see 


https://docs.scipy.org/doc/numpy/reference/random/index.html. 
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D.11 Matplotlib 


The main Python graphics library for 2D and 3D plotting is matplot1ib, and its subpack- 
age pypLot contains a collection of functions that make plotting in Python similar to that 
in MATLAB. 


D.11.1 Creating a Basic Plot 


The code below illustrates various possibilities for creating plots. The style and color of 
lines and markers can be changed, as well as the font size of the labels. Figure D.1 shows 
the result. 


sqrtplot.py 


import matplotlib.pyplot as plt 

import numpy as np 
np.arange(0, 10, 0.1) 
np.arange(0,10) 
np.sqrt(x) 
u/3 
.figure(figsize = [4,2]) # size of plot in inches 
-plot(x,y, 'g--') # plot green dashed line 
‘plot (u,v Se) # plot red dots 
.Xlabel('x') 
.ylabel('y') 
.tight_layout © 
.savefig('sqrtplot.pdf',format='pdf') # saving as 
. show () # both plots will now be 


0 2 4 6 8 10 


Figure D.1: A simple plot created using pyplot. 


The library matplot1lib also allows the creation of subplots. The scatterplot and histogram 
in Figure D.2 have been produced using the code below. When creating a histogram there 
are several optional arguments that affect the layout of the graph. The number of bins is 
determined by the parameter bins (the default is 10). Scatterplots also take a number of 
parameters, such as a string c which determines the color of the dots, and alpha which 
affects the transparency of the dots. 
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histscat.py 


import matplotlib.pyplot as plt 

import numpy as np 
np.random.randn (1000) 
np.random.randn(100) 
np.random.randn(100) 
. subplot (121) # first subplot 
.-hist(x,bins=25, facecolor='b') 
-Xlabel('X Variable') 
-ylabel('Counts') 
. subplot (122) # second subplot 
.scatter(u,v,c='b', alpha=0.5) 
. show () 











o opa? y s 


Counts 














—2 (0) 2 —2 (0) 2 
X Variable 


Figure D.2: A histogram and scatterplot. 


One can also create three-dimensional plots as illustrated below. 





surf3dscat.py 


import matplotlib.pyplot as plt 
import numpy as np 
from mpl_toolkits.mplot3d import Axes3D 


def npdf(x,y): 
return np.exp(-0.5*(pow(x,2)+pow(y,2)))/np.sqrt(2*np.pi) 


x, y = np.random.randn(100), np.random.randn(100) 
z = npdf(x,y) 


xgrid, ygrid = np.linspace(-3,3,100), np.linspace(-3,3,100) 


Xarray, Yarray = np.meshgrid(xgrid, ygrid) 
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Zarray = npdf(Xarray, Yarray) 


fig = plt.figure(figsize=plt.figaspect (0.4) ) 
axl = fig.add_subplot(121, projection='3d') 
axl.scatter(x,y,z, c='g') 
axl.set_xlabel('$x$') 

axl.set_ylabel('$y$') 
axl1.set_zlabel('$f(x,y)$"') 


ax2 = fig.add_subplot(122, projection='3d') 

ax2.plot_surface(Xarray , Yarray,Zarray ,cmap='viridis', 
edgecolor='none') 

ax2.set_xlabel('$x$') 

ax2.set_ylabel('$y$') 

ax2.set_zlabel('$f(x,y)$"') 


plt.show() 








Figure D.3: Three-dimensional scatter- and surface plots. 


D.12 Pandas 


The Python package Pandas (module name pandas) provides various tools and data struc- 
tures for data analytics, including the fundamental DataFrame class. 


For the code in this section we assume that pandas has been imported via 


import pandas as pd. 





D.12.1 Series and DataFrame 


The two main data structures in pandas are Series and DataFrame. A Series object can 
be thought of as a combination of a dictionary and an 1-dimensional ndarray. The syntax 
for creating a Series object is 
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series = pd.Series(<data>, index=['index']) 
Here, <data> some 1|-dimensional data structure, such as a 1-dimensional ndarray, a list, 
or a dictionary, and index is a list of names of the same length as <data>. When <data> 
is a dictionary, the index is created from the keys of the dictionary. When <data> is an 
ndarray and index is omitted, the default index willbe [0, ..., len(data)-1]. 
DICT = {%one 21, “two:2, three’ =3, "four 24: 
print (pd. Series (DICT) ) 
one 1 
two 2 
three 3 
four 4 
dtype: int64 
years = ['2000','2001','2002'] 
cost = [2.34, 2.89, 3.01] 
print (pd.Series(cost,index = years, name = 'MySeries')) #name it 
2000 2.34 
2001 2.89 
2002 3.01 
Name: MySeries, dtype: float64 
The most commonly-used data structure in pandas is the two-dimensional DataFrame, 
which can be thought of as pandas’ implementation of a spreadsheet or as a diction- 
ary in which each “key” of the dictionary corresponds to a column name and the dic- 
tionary “value” is the data in that column. To create a DataFrame one can use the 
pandas DataFrame method, which has three main arguments: data, index (row labels), 
and columns (column labels). 
DataFrame(<data>, index=['<row_name>'], columns=['<column_name>']) 
If the index is not specified, the default index is [0, ..., len(data)-1]. Data can 
nS | also be read directly from a CSV or Excel file, as is done in Section 1.1. If a dictionary is 


used to create the data frame (as below), the dictionary keys are used as the column names. 


DICT = {'numbers':[1,2,3,4], 'squared':[1,4,9,16] } 
df = pd.DataFrame(DICT, index = list('abcd')) 
print (df) 


numbers squared 
1 


2 
3 
4 
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D.12.2 Manipulating Data Frames 


Often data encoded in DataFrame or Series objects need to be extracted, altered, or com- 
bined. Getting, setting, and deleting columns works in a similar manner as for dictionaries. 
The following code illustrates various operations. 


ages = [6,3,5,6,5,8,0,3] 

d={'Gender':['M', 'F']*4, 'Age': ages} 

df1 = pd.DataFrame(d) 

df1l.at[0,'Age']J= 60 # change an element 

df1l.at[1,'Gender'] = 'Female' # change another element 

df2 = dfl.dropC('Age',1) # drop a column 

df3 = df2.copy(); # create a separate copy of df2 

df3['Age'] = ages # add the original column 

dfcomb = pd.concat ([df1,df2,d£3],axis=1) # combine the three dfs 
print (dfcomb) 


Gender 
M 
Female 
M 


> 
Q 
e] 


NOU BPWNF OS 
WoeowWoUna UWA 


Note that the above DataFrame object has two Age columns. The expression 
dfcomb[’ Age’ ] will return a DataFrame with both these columns. 


Table D.3: Useful pandas methods for data manipulation. 


agg Aggregate the data using one or more functions. 
apply Apply a function to a column or row. 

astype Change the data type of a variable. 

concat Concatenate data objects. 

replace Find and replace values. 

read_csv Read a CSV file into a DataFrame. 
sort_values Sort by values along rows or columns. 

stack Stack a DataFrame. 

to_excel Write a DataFrame to an Excel file. 


It is important to correctly specify the data type of a variable before embarking on 
data summarization and visualization tasks, as Python may treat different types of objects 
in dissimilar ways. Common data types for entries in a DataFrame object are float, 
category, datetime, bool, and int. A generic object type is object. 


d={" Gender: Mn; CFn CFE =A; Age" [16:735.5,.05.5, 6.101,.9 10:05:75,044 
df=pd.DataFrame (d) 


print (df.dtypes) 
df['Gender'] = df['Gender'].astype('category') #change the type 
print (df.dtypes) 
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Gender object 
Age int64 
dtype: object 
Gender category 
Age int64 
dtype: object 
D.12.3 Extracting Information 
Extracting statistical information from a DataFrame object is facilitated by a large col- 
lection of methods (functions) in pandas. Table D.4 gives a selection of data inspection 
rs ] methods. See Chapter | for their practical use. The code below provides several examples 





of useful methods. The apply method allows one to apply general functions to columns 
or rows of a DataFrame. These operations do not change the data. The loc method allows 
for accessing elements (or ranges) in a data frame and acts similar to the slicing operation 
for lists and arrays, with the difference that the “stop” value is included, as illustrated in 
the code below. 


import numpy as np 

import pandas as pd 

ages = [6,3,5,6,5,8,0,3] 

np.random.seed(123) 

df = pd.DataFrame(np.random.randn(3,4), index = list('abc'), 
columns = list(C'ABCD')) 

print (df) 

df= dit loc Ibe e Be a Gee] # create a partial data frame 

print (df1) 

meanA = df['A'].mean() # mean of 'A' column 

print(C'mean of column A = {}'.format (meanA) ) 

expA = df['A'].apply(np.exp) # exp of all elements in 'A' column 

print (CexpA) 


B C 
-085631 0.997345 0.282978 -1.506295 
- 578600 .651437 -2.426679 -0.428913 
.265936 -0.866740 -0.678886 -0.094709 
B C 
.651437 -2.426679 
.866740 -0.678886 
of column A = -Q.13276486552118785 
0.337689 
0.560683 
3.546412 
Name: A, dtype: float64 


The groupby method of a DataFrame object is useful for summarizing and displaying 
the data in manipulated ways. It groups data according to one or more specified columns, 
such that methods such as count and mean can be applied to the grouped data. 
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Table D.4: Useful pandas methods for data inspection. 


columns 
count 
crosstab 
describe 
dtypes 
head 
groupby 
info 

loc 
mean 
plot 

std 

sum 

tail 


Column names. 

Counts number of non-NA cells. 
Cross-tabulate two or more categories. 
Summary statistics. 

Data types for each column. 

Display the top rows of a DataFrame. 
Group data by column(s). 

Display information about the DataFrame. 
Access a group or rows or columns. 
Column/row mean. 

Plot of columns. 

Column/row standard deviation. 

Returns column/row sum. 

Display the bottom rows of a DataFrame. 


value_counts Counts of different non-null values. 


var 


Variance. 


df = pd.DataFrame({'W':['a','a','b','a','a','b'], 
'X'inp.random.rand(6) , 
Ye, de de de, GY. “er ji), <2 snp random. rand C6)3) 


print (df) 


W 
a 


b 


0.662639 
0.136502 


X 
7995529 
-925746 
-266772 
-201974 
-529505 
-006231 


X 


0. 
0. 


Z 
. 641084 
-428412 
-460665 
. 261879 
-503112 
-849683 


458622 
655174 


print(df.groupby(['W', 'Y']).mean()) 


X 


0.761417 
0.563860 
0.006231 
0.266772 


0.572098 
0.345145 
0.849683 
0.460665 





To allow for multiple functions to be calculated at once, the agg method can be used. 


It can take a list, dictionary, or string of functions. 
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print (df.groupby('W').aggC[sum,np.mean])) 


X Z 
sum mean sum mean 


W 
a 2.650555 0.662639 1.834487 0.458622 
b 0.273003 0.136502 1.310348 0.655174 





D.12.4 Plotting 


The plot method of a DataFrame makes plots of a DataFrame using Matplotlib. Different 
types of plot can be accessed via the kind = 'str' construction, where str is one of 
line (default), bar, hist, box, kde, and several more. Finer control, such as modifying 
the font, is obtained by using matplotlib directly. The following code produces the line 
and box plots in Figure D.4. 


import numpy as np 

import pandas as pd 

import matplotlib 

df = pd.DataFrame({'normal':np.random.randn(100), 
'Uniform':np.random.uniform(0,1,100)}) 

font = {" family" : 'serif', ‘'size' : 14} #set font 

matplotlib.rc('font', **font) # change font 

df.plotQ # line plot (default) 

df.plot(kind = 'box') # box plot 

matplotlib.pyplot.show() #render plots 
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Figure D.4: A line and box plot using the plot method of DataFrame. 


D.13 Scikit-learn 


Scikit-learn is an open-source machine learning and data science library for Python. The 
library includes a range of algorithms relating to the chapters in this book. It is widely 
used due to its simplicity and its breadth. The module name is sklearn. Below is a brief 
introduction into modeling the data with sklearn. The full documentation can be found 
at 


https://scikit-learn.org/. 
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D.13.1 Partitioning the Data 


Randomly partitioning the data in order to test the model may be achieved easily with 
sklearn’s function train_test_split. For example, suppose that the training data is 
described by the matrix X of explanatory variables and the vector y of responses. Then the 
following code splits the data set into training and testing sets, with the testing set being 
half of the total set. 


from sklearn.model_selection import train_test_split 


X_train, X_test, y_train, y_test = train_test_split(X, y, 
test_size = 0.5) 





As an example, the following code generates a synthetic data set and splits it into 
equally-sized training and test sets. 


syndat.py 


import numpy as np 
import matplotlib.pyplot as plt 
from sklearn.model_selection import train_test_split 


np.random. seed (1234) 


X=np.pi*(2*np.random. random(size=(400,2))-1) 
y=(np.cos(X[: ,0]) *np.sin(X[: ,1]) >=0) 


X_train , X_test , y_train , y_test = train_test_split(X, y, 
test_size=0.5) 


fig = plt.figure() 

= fig.add_subplot (111) 

. scatter (X_train[y_train==0 ,0],X_train[y_train==0,1], c='g', 
marker='o',alpha=0.5) 

. scatter (X_train[y_train==1,0],X_train[y_train==1,1], c='b', 
marker='o',alpha=0.5) 

. scatter (X_test [y_test==0,0],X_test[y_test==0,1], c='g', 
marker='s',alpha=0.5) 

. scatter (X_test[y_test==1,0],X_test[y_test==1,1], c='b', 
marker='s',alpha=0.5) 


plt.savefig('sklearntraintest.pdf',format='pdf') 
plt.show() 


D.13.2 Standardization 


In some instances it may be necessary to standardize the data. This may be done in 
sklearn with scaling methods such as MinMaxScaler or StandardScaler. Scaling may 
improve the convergence of gradient-based estimators and is useful when visualizing data 
on vastly different scales. For example, suppose that X is our explanatory data (e.g., stored 
as a numpy array), and we wish to standardize such that each value lies between 0 and 1. 
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Figure D.5: Example training (circles) and test (squares) set for two class classification. 
Explanatory variables are the (x, y) coordinates, classes are zero (green) or one (blue). 

from sklearn import preprocessing 

min_max_scaler = preprocessing.MinMaxScaler(feature_range=(0, 1)) 

x_scaled = min_max_scaler.fit_transform(X) 

# equivalent to: 

x_scaled = (X - X.min(axis=0)) / (X.max(Caxis=0) - X.min(axis=0)) 

D.13.3 Fitting and Prediction 
Once the data has been partitioned and standardized if necessary, the data may be fitted to 
a Statistical model, e.g., a classification or regression model. For example, continuing with 
our data from above, the following fits a model to the data and predicts the responses for 
the test set. 

from sklearn.someSubpackage import someClassifier 

clf = someClassifier() # choose appropriate classifier 

clf.fit(X_train, y_train) # fit the data 

y_prediction = clf.predict(X_test) # predict 

Specific classifiers for logistic regression, naive Bayes, linear and quadratic discrimin- 
ant analysis, K-nearest neighbors, and support vector machines are given in Section 7.8. 
cS 277 


D.13.4 Testing the Model 


Once the model has made its prediction we may test its effectiveness, using relevant met- 
rics. For example, for classification we may wish to produce the confusion matrix for the 
test data. The following code does this for the data shown in Figure D.5, using a support 
vector machine classifier. 
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from sklearn import svm 

clf = svm.SVCCkernel = 'rbf') 
clf.fit(X_train , y_train) 
y_prediction = clf.predict(X_test) 


from sklearn.metrics import confusion_matrix 
print (confusion_matrix(y_test , y_prediction)) 


[[102 12] 
[ 1 85]] 





D.14 System Calls, URL Access, and Speed-Up 


Operating system commands (whether in Windows, MacOS, or Linux) for creating dir- 
ectories, copying or removing files, or executing programs from the system shell can be 
issued from within Python by using the package os. Another useful package is requests 
which enables direct downloads of files and webpages from URLs. The following Python 
script uses both. It also illustrates a simple example of exception handling in Python. 


misc.py 


import os 
import requests 
for c in “1234567: 
try: if it does not yet exist 
os.mkdir("MyDir"+ c) make a directory 
except: otherwise 
pass do nothing 


uname = "https://github.com/DSML-book/Programs/tree/master/ 
Appendices/Python Primer/" 
fname = "ataleof2cities.txt" 
r = requests.get(uname + fname) 
print (r.text) 
open('MyDirl/ato2c.txt', 'wb').write(r.content) #write to a file 
# bytes mode is important here 


The package numba can significantly speed up calculations via smart compilation. First 
run the following code. 





import timeit 

import numpy as np 
from numba import jit 
n = 10**8 


#@jit 
def myfun(s,n): 
for i in range(1,n): 
s = s+ 1/i 
return s 
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start = timeit.time.clockQ 

print("Euler's constant is approximately {:9.8f}".format( 
myfun(0,n) - np.log(n))) 

end = timeit.time.clock(Q) 

print("elapsed time: {:3.2f£} seconds". format (end-start) ) 


Euler's constant is approximately 0.57721566 


elapsed time: 5.72 seconds 





Now remove the # character before the @ character in the code above, in order to 
activate the “just in time” compiler. This gives a 15-fold speedup: 


Euler's constant is approximately 0.57721566 
elapsed time: 0.39 seconds 


Further Reading 


To learn Python, we recommend [82] and [110]. However, as Python is constantly evolving, 
the most up-to-date references will be available from the Internet. 
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A 


acceptance probability, 78—80, 97 

acceptance—rejection method, 73, 78 

accuracy (classification—), 254 

activation function, 204, 325 

AdaBoost, 317—320 

AdaGrad, 339 

Adam method, 339, 346 

adjoint operation, 361 

affine transformation, 405, 435 

agglomerative clustering, 147 

Akaike information criterion, 126, 176, 
177 

algebraic multiplicity, 363 

aligned arrays (Python), 481 

almost sure convergence, 440 

alternating direction method of 
multipliers, 220, 416 

alternative hypothesis, 458 

anaconda (Python), 463 

analysis of variance (ANOVA), 183, 184, 
195, 208 

annealing schedule, 97 

approximation error, 32-34, 184 

approximation—estimation tradeoff, 32, 
41, 323 

Armijo inexact line search, 409 

assignment operator (Python), 467 

attributes (Python), 465 


503 


INDEX 


auxiliary variable methods, 128 
axioms of Kolmogorov, 421 


B 


back-propagation, 331 
backward elimination, 201 
backward substitution, 370 
bagged estimator, 306 
bagging, 305, 307, 310 
balance equations (Markov chains), 78, 
79, 452 

bandwidth, 131, 134, 225 
barplot, 9 
barrier function, 417 
Barzilai—-Borwein formulas, 334, 413 
basis 

of a vector space, 355 

orthogonal -, 361 
Bayes 

empirical, 241 

error rate, 252 

factor, 57 

naive —, 258 

optimal decision rule, 258 
Bayes’ rule, 47, 48, 428, 454 
Bayesian information criterion, 54 
Bayesian statistics, 47, 49, 454 
Bernoulli distribution, 425, 457 
Bessel distribution, 164, 226 
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beta distribution, 52, 425 

bias of an estimator, 454 

bias vector (deep learning), 326 

bias—variance tradeoff, 35, 305 

binomial distribution, 425 

Boltzmann distribution, 96 

bootstrap aggregation, see bagging 

bootstrap method, 88, 306 

bounded mapping, 389 

boxplot, 10, 14 

broadcasting (Python), 481 

Broyden’s family, 411 

Broyden—Fletcher—Goldfarb—Shanno 
(BFGS) updating, 267, 338, 411 

burn-in period, 78 


C 
categorical variable, 3, 177, 178, 191, 
192, 251, 299 
Cauchy sequence, 245, 385 
Cauchy—Schwarz inequality, 223, 246, 
389, 412 
central difference estimator, 106 
central limit theorem, 447, 458 
multivariate, 448 
centroid, 144 
chain rule for differentiation, 401 
characteristic function, 225—227, 246, 
392, 441 
characteristic polynomial, 363 
Chebyshev’s inequality, 444 
chi-squared distribution, 436, 439 
Cholesky decomposition, 70, 154, 247, 
264, 373 
circulant matrix, 381, 393 
class (Python), 473 
classification, 20, 251-286 
hierarchical, 256 
multilabel, 256 
classifier, 21, 251 
coefficient of determination, 181, 195 
adjusted, 181 
coefficient profiles, 221 
combinatorial optimization, 402 
comma separated values (CSV), 2 
common random numbers, 106, 119 


complete Hilbert space, 224, 384 
complete vector space, 216 
complete convergence, 443 
complete-data 

likelihood, 128 

log-likelihood, 138 
composition of functions, 400 
concave function, 404, 407 
conditional 

distribution, 431 

expectation, 431 

pdf, 74, 430 

probability, 428 
confidence interval, 85, 89, 94, 186, 457 

Bayesian, 51 

bootstrap, 89 
confidence region, 457 
confusion matrix, 253, 254 
constrained optimization, 403 
context management (Python), 477 
continuous mapping, 389 
continuous optimization, 402 
control variable, 92 
convergence 

almost sure, 440 

in L? norm, 442 

in distribution, 440 

in probability, 439 

sure, 439 
convex 

function, 62, 220, 403 

program, 405—408 

set, 42, 403 
convolution, 380, 392 
convolution neural network, 329 
Cook’s distance, 212 
cooling factor, 97 
correlation coefficient, 71, 429 
cost-complexity 

measure, 303 

pruning, 303 
countable sample space, 422 
covariance, 429 

matrix, 45, 70, 430-432, 435, 436 

properties, 429 
coverage probability, 457 
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credible 
interval, 51 
region, 51 
critical 
region, 458 
value, 458 
cross tabulate, 7 
cross-entropy 
method, 100, 110 
risk, 53, 122, 125 
in-sample, 176 
training loss, 123 
cross-validation, 37, 38 
leave-one-out, 40, 173 
linear model, 174 
crude Monte Carlo, 85 
cubic spline, 236 
cumulative distribution function (cdf), 
72, 423 
joint, 427 
cycle, 81 


D 
Davidon—Fletcher—Powell updating, 352, 
412 
decision tree, 288 
deep learning, 330 
degrees of freedom, 181 
dendrogram, 148 
density, 385 
dependent variable, 168 
derivatives 
multidimensional, 398 
partial, 397 
design matrix, 179 
detailed balance equations, 453 
determinant of a matrix, 357 
diagonal matrix, 357 
diagonalizable, 364 
dictionary (Python), 473 
digamma function, 127, 162 
dimension, 355 
direct sum, 217 
directional derivative, 404 
discrete 
distribution, 423 


Fourier transform, 392 
optimization, 402 
probability space, 422 
sample space, 422 
uniform distribution, 425 
discriminant analysis, 259 
distribution 
Bernoulli, 425 
Bessel, 226 
beta, 52, 425 
binomial, 425 
chi-squared, 436, 439 
discrete, 423 
discrete uniform, 425 
exponential, 425 
extreme value, 114 
F, 439 
gamma, 425 
Gaussian, see normal 
geometric, 425 
inverse-gamma, 50, 83 
joint, 427 
multivariate normal, 45, 435 
noncentral y?, 437 
normal, 44, 425, 434 
Pareto, 425 
Poisson, 425 
probability, 422, 427 
Student’s t, 439 
uniform, 425 
Weibull, 425 
divisive clustering, 147 
dot notation (Python), 466 
dual optimization problem, 407—408 


E 


early stopping, 49, 250 
efficiency 

of estimators, 454 

of acceptance-rejection, 72 
eigen-decomposition, 364 
eigenvalue, 363 
eigenvector, 363 
elementary event, 422 
elite sample, 100 
empirical 
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Bayes, 241 

cdf, 11, 76 

distribution, 131 
entropy impurity, 292 
epoch (deep learning), 349 
equilikely principle, 422 
ergodic Markov chain, 452 
error of the first and second kind, 459 
estimate, 454 
estimator, 454 

bias of, 454 

control variable, 92 

efficiency of, 454 

unbiased, 454 
Euclidean norm, 360 
evaluation functional, 223, 245 
event, 421 

elementary, 422 

independent, 428 
exact match ratio, 257 
exchangeable variables, 40 
expectation, 426 

conditional, 431 

properties, 429, 431 

vector, 45, 430, 432, 435 
expectation—maximization (EM) 

algorithm, 128, 137, 209 

expected generalization risk, 24 
expected optimism, 36 
explanatory variable, 22, 168 
exponential distribution, 425 
extreme value distribution, 114 


F 
factor, 3, 178 
false negative, 254 
false positive, 254 
fast Fourier transform, 394 
Fg score, 255 
F distribution, 183, 197, 424, 439 
feasible region, 403 
feature, 1, 20 
importance, 311 
map, 189, 216, 224, 230, 243, 274 
feed-forward network, 326 
feedback shift register, 69 


finite difference method, 107, 113 
finite-dimensional distributions, 427 
Fisher information matrix, 124 
Fisher’s scoring method, 127 
folds (cross-validation), 38 
forward selection, 200 
forward substitution, 370 
Fourier expansion, 386 
Fourier transform, 391 
discrete, 392 
frequentist statistics, 453 
full rank matrix, 28 
function (Python), 468 
function space, 384 
function, C*, 403 
functional, 389 
functions of random variables, 431 


G 
gamma 

distribution, 425 

function, 424 
Gauss—Markov inequality, 59 
Gauss—Newton search direction, 414 
Gaussian distribution, see normal 

distribution 
Gaussian kernel, 225 
Gaussian kernel density estimate, 131 
Gaussian process, 71, 238 
Gaussian rule of thumb, 134 
generalization risk, 23, 86 
generalized inverse-gamma distribution, 
163 

generalized linear model, 204 
geometric cooling, 97 
geometric distribution, 425 
geometric multiplicity, 363 
Gibbs pdf, 97 
Gibbs sampler, 81, 83, 84 

random, 82 

random order, 82 

reversible, 82 
Gini impurity, 292 
global balance equations, 452 
global minimizer, 402 
gradient, 397, 403 
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boosting, 316 

descent, 412 
Gram matrix, 218, 222, 270 
Gram-Schmidt procedure, 375 


H 


Hamming distance, 142 
Hermite polynomials, 389 
Hermitian matrix, 362, 365 
Hessian matrix, 124, 398, 403, 404 
hidden layer, 325 
hierarchical classification, 256 
Hilbert matrix, 33 
inverse, 60 
Hilbert space, 215, 385 
isomorphism, 246 
hinge loss, 269 
histogram, 10 
Hoeffding’s inequality, 62 
homotopy paths, 221 
hyperparameters, 50, 241 
hypothesis testing, 458 


I 
immutable (Python), 464 
importance sampling, 93—96 
improper prior, 50, 83 
in-sample risk, 35 
incremental effects, 179 
independence 
of event, 428 
of random variables, 429 
independence sampler, 79 
independent and identically distributed 
(iid), 429, 446, 454 
indicator, 11 
indicator feature, 178 
indicator loss, 251 
infinitesimal perturbation analysis, 113 
information matrix equality, 124 
inheritance (Python), 474 
initial distribution (Markov chain), 452 
inner product, 360 
instance (Python), 474 
integration 
Monte Carlo, 86 
interaction, 179, 193 


interior-point method, 418 
interval estimate, see confidence interval 
inverse 
discrete Fourier transform, 393 
Fourier transform, 391 
matrix, 370 
inverse-gamma distribution, 50, 83 
inverse-transform method, 72 
irreducible risk, 32 
iterable (Python), 472 
iterative reweighted least squares, 213, 
349 
iterator (Python), 472 


J 
Jacobi 
matrix of, 409, 433 
Jensen’s inequality, 62 
joint 
cdf, 427 
pdf, 427 
jointly normal, see multivariate normal 
jointly normal distribution, see 
multivariate normal distribution 


K 

Karush—Kuhn—Tucker (KKT) conditions, 
407, 408 

kernel density estimation, 131, 135, 226, 
329 

kernel trick, 231 

Kiefer—-Wolfowitz algorithm, 107 

K-nearest neighbors method, 268 

Kolmogorov axioms, 421 

Kullback—Leibler divergence, 42, 100, 
128, 350 


L 
Lagrange 
dual program, 407 
function, 406 
method, 406—407 
multiplier, 406 
Lagrangian, 406, 416 
penalty, 416 
Laguerre polynomials, 388 
Lance—Williams update, 149 
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Laplace’s approximation, 450 
lasso (regression), 220 
latent variable methods, see auxiliary 
variable methods 
law of large numbers, 67, 446, 458 
law of total probability, 428 
learner, 22, 168 
learning rate, 334, 409 
least-squares 
iterative reweighted, 213 
nonlinear, 190, 335, 414 
ordinary, 27, 46, 171, 191, 211, 378 
regularized, 172, 235, 376 
leave-one-out cross-validation, 40, 173 
left pseudo-inverse, 360 
left-eigenvector, 365 
Legendre polynomials, 387 
length preserving transformation, 361 
length of a vector, 360 
level set, 103 
Levenberg—Marquardt search direction, 
415 
leverage, 173 
Levinson—Durbin, 71, 382 
likelihood, 42, 48, 123, 456 
complete-data, 128 
log-, 136, 456 
optimization, 137 
ratio, 93 
limited memory BFGS, 336 
limiting pdf, 452 
limiting pdf (Markov chain), 452 
line search, 409 
linear 
discriminant function, 260 
kernel, 224, 271 
mapping, 389 
model, 43, 211 
program, 406 
subspace, 362 
transformation, 356, 431 
linearly independent, 355 
link function, 204 
linkage, 148 
matrix, 150 
list comprehension (Python), 473 


local balance equations, see detailed 
balance equations 
local minimizer, 402 
local/global minimum, 402 
log-likelihood, 456 
log-odds ratio, 266 
logarithmic efficiency, 117 
logistic distribution, 204 
logistic regression, 204 
long-run average reward, 89 
loss function, 20 
loss matrix, 253 


M 
M-estimator, 448 
Manhattan distance, 142 
marginal distribution, 427, 436 
Markov chain, 74, 78, 80, 82, 451 
ergodic, 452 
reversible, 452 
simulation of, 75 
Markov chain Monte Carlo, 78 
Markov property, 74, 451 
Matérn kernel, 226 
matplotlib (Python), 483—485 
matrix, 356 
blockwise inverse, 370 
covariance, 70, 436 
determinant, 357 
diagonal —, 357 
inverse, 357 
of Jacobi, 398, 409, 414, 433 
pseudo-inverse, 360 
sparse, 379 
Toeplitz, 379 
trace, 357 
transpose, 357 
matrix multiplication (Python), 481 
max-cut problem, 151 
maximum a posteriori, 52 
maximum distance, 142 
maximum likelihood estimation, 42, 46, 
100, 127, 136, 137, 456 
mean integrated squared error, 133 
mean squared error, 32, 88, 454 
measure, 385 
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Mersenne twister, 69 
method (Python), 466 
method of moments, 455 
Metropolis—Hastings algorithm, 78, 81 
minibatch, 335 
minimax 
equality, 408 
problem, 408 
minimization, 411 
minimizer, 402 
minimum 
global, 402 
local, 402 
misclassification error, 253 
misclassification impurity, 292 
mixture density, 135 
model, 40 
evidence, 54 
linear, 211 
matrix, 43, 170, 174 
multiple linear regression, 169 
normal linear, 174, 182, 183, 438 
regression, 191 
response surface, 189 
simple linear regression, 187 
modified Bessel function of the second 
kind, 163, 226 
module (Python), 469 
modulo 2 generators, 69 
modulus, 69 
moment 
generating function, 427, 436 
sample-, 455 
momentum method, 340 
Monte Carlo 
integration, 86 
sampling, 68—85 
simulation, 67 
Moore-Penrose pseudo-inverse, 360 
multi-logit, 266 
multi-output linear regression, 213 
nonlinear, 328 
multilabel classification, 256 
multiple linear regression, 169 
multiple-recursive generator, 69 
multiplier 


Lagrange, 406 
multivariate 
central limit theorem, 448 
normal distribution, 44—46, 435 
mutable (Python), 464 


N 


naive Bayes, 258 
namespace (Python), 470 
nested models, 58, 180 
network architecture, 329 
network depth, 327 
network width, 327 
neural networks, 323 
Newton’s method, 127, 205, 213, 336, 
409 
— for root-finding, 409 
quasi —, 336 
Neyman-—Pearson approach, 459 
noisy optimization, 105 
nominal distribution, 93 
noncentral y? distribution, 437 
norm, 384, 389 
normal distribution, 45, 425, 434, 435 
normal equations, 28 
normal linear model, 46, 174, 182, 183, 
438 
normal matrix, 365 
normal method (bootstrap), 89 
normal model, 44 
Bayesian, 49, 50, 83 
normal updating (cross-entropy), 101 
null hypothesis, 458 
null space, 363 


O 

object (Python), 464 

objective function, 402, 403, 407, 415 

Occam’s razor, 173 

operator, 389 

operator (Python), 465 

optimal decision boundary, 270 

optimization 
combinatorial, 402 
constrained, 403 
continuous, 402 
unconstrained, 403 


510 


Index 





ordinary least-squares, 27 
orthogonal 

basis, 361 

complement, 362 

matrix, 361, 382 

polynomial, 388 

projection, 362 

vector, 360 
orthonormal, 361 

basis, 386 

system, 385 
out-of-bag, 307 
overfitting, 23, 35, 141, 172, 216, 237, 

289, 293, 300, 314 

overloading (Python), 468 


P 
p-norm, 220, 408 
P-value, 195, 459 
pandas (Python), 2, 485—490 
Pareto distribution, 425 
Parseval’s formula, 392 
partial derivative, 397 
partition, 428 
peaks function, 233 
Pearson’s height data, 207 
penalty function, 415, 419 

exact, 416 
percentile, 7 
percentile method (bootstrap), 89, 91 
permutation matrix, 368 
Plancherel’s theorem, 392 
PLU decomposition, 368 
pointwise squared bias, 35 
pointwise variance, 35 
Poisson distribution, 425 
polynomial kernel, 230 
polynomial regression model, 26 
positive definite 

matrix, 403 
positive semidefinite 

function, 222 

matrix, 367, 404 
posterior 

pdf, 48 

predictive density, 49 


precision, 255 
predicted residual, 173 
— sum of squares (PRESS), 173 
prediction function, 20 
prediction interval, 186 
predictive mean, 240 
predictor, 168 
primal optimization problem, 407 
principal axes, 154 
principal component analysis (PCA), 
153, 155 
principal components, 154 
prior 
improper, 83 
pdf, 48 
predictive density, 49 
uninformative, 49 
probability 
density function (pdf), 424 
density function (pdf), joint, 427 
distribution, 422, 427 
mass function, 424 
measure, 421 
space, 422 
product rule, 74, 428, 452 
projected subgradient method, 106 
projection matrix, 27, 173, 211, 265, 
362, 438 
projection pursuit, 349 
proposal (MCMC), 78 
pseudo-inverse, 28, 211, 360, 378 
Pythagoras’ theorem, 180, 181, 183, 231, 
361 


Q 


quadratic discriminant function, 260 
quadratic program, 406 

qualitative variable, 3 

quantile, 51, 85 

quantile—quantile plot, 199 
quantitative variable, 3 

quartile, 7 

quasi-Newton method, 336, 411 
quasi-random point set, 233 
quotient rule for differentiation, 160 
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R 
radial basis function (rbf) kernel, 225, 
276 
random 
experiment, 421 
number generator, 68 
numbers (Python), 482 
sample 
see iid sample, 454 
variable, 422 
vector, 427, 431 
covariance of, 430 
expection of, 430 
walk sampler, 80 
range (Python), 472 
rank, 28, 356 
rarity parameter (cross-entropy), 100 
ratio estimator, 89 
read_csv (Python), 2 
recall, 255 
reference (Python), 467 
regional prediction functions, 288 
regression, 20, 167 
function, 21 
line, 169 
model, 191 
simple linear, 181 
regularization, 216, 217 
paths, 221 
regularization parameter, 217 
regularizer, 217 
relative error (estimated), 85 
relative time variance product, 454 
renewal reward process, 89 
representational capacity, 323 
representer of evaluation, 222 
reproducing kernel Hilbert space 
(RKHS), 222 
reproducing property, 222 
resampling, 76, 88 
residual squared error, 171 
residual sum of squares, 171 
residuals, 171, 173 
response surface model, 189 
response variable, 20, 168 
reverse Markov chain, 452 


reversibility, 452 

reversible Gibbs sampler, 82 

ridge regression, 216, 217 
Riemann—Lebesgue lemma, 391 
right pseudo-inverse, 360 

risk, 20, 167 

Robbins—Monro algorithm, 106 
root finding, 408 

R’, see coefficient of determination 


S 
saddle point, 403 

problem, 408 
sample 

mean, 7, 85, 455 

median, 7 

quantile, 7 

range, 8 

space, 421 

countable, 422 
discrete, 422 

standard deviation, 8, 455 

variance, 8, 89, 455 
saturation, 332 
Savage—Dickey density ratio, 58 
scale-mixture, 164 
scatterplot, 13 
scikit-learn (Python), 490—493 
score function, 42, 123 

method, 113 
secant condition, 411 
semi-simple matrix, 364 
sequence object (Python), 472 
set (Python), 472 
shear operation, 359 
Sherman—Morrison 

formula, 174, 247, 371 

recursion, 372, 373, 414 
significance level, 459 
simple linear regression, 169, 187 
simulated annealing, 96, 97 
sinc kernel, 226 
singular value, 377, 378 
singular value decomposition, 154, 376 
slack variable, 417 
Slater’s condition, 408 
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slice (Python), 3, 464 
smoothing parameter, 100 
softmax function, 267, 328 
source vectors, 143 
sparse matrix, 379 
specificity, 255 
spectral representation, 377 
sphere the data, 264 
splitting for continuous optimization, 
103 
splitting rule, 289 
squared-error loss, 167 
standard basis, 356 
standard deviation, 426 
sample-, 455 
standard error (estimated), 85 
standard normal distribution, 434 
standardization, 435 
stationary point, 403 
statistical (estimation) error, 32, 95 
statistical test 
one-sided —, 458 
two-sided —, 458 
statistics 
Bayesian, 454 
frequentist, 453 
steepest descent, 330, 412 
step-size parameter y, 314 
stochastic approximation, 106, 335 
stochastic confidence interval, 457 
stochastic counterpart, 107 
stochastic gradient descent, 335, 349 
stochastic process, 427 
strict feasibility, 408 
strong duality, 408 
Student’s ¢ distribution, 183, 424, 439 
multivariate, 162, 164, 226 
studentized residual, 212 
stumps, 319 
subgradient, 404 
subgradient method, 106 
sum rule, 422 
supervised learning, 22 
support vectors, 271 
Sylvester equation, 379 
systematic Gibbs sampler, 81 


T 
tables 
counts, 6 
frequency, 6 
margins, 7 
target distribution, 78 
Taylor’s theorem 
multidimensional, 400 
test 
loss, 24 
sample, 24 
Statistic, 458 
theta KDE, 134 
time-homogeneous, 452 
Tobit regression, 209 
Toeplitz matrix, 379 
total sum of squares, 181 
tower property of expectation, 431 
trace of a matrix, 357 
training loss, 23 
training set, 21 
transformation 
of random variables, 431, 433 
rule, 95, 432 
transition 
density, 74, 452 
graph, 75 
transpose of a matrix, 356, 357 
tree branch, 302 
true negative, 254 
true positive, 254 
trust region, 409 
type (Python), 466 
type I and type II errors, 459 


U 

unbiased, 59 

unbiased estimator, 454 
unconstrained optimization, 403 
uniform distribution, 425 

union bound, 422 

unitary matrix, 362 

universal approximation property, 226 
unsupervised learning, 22 


y 
validation set, 25, 303 
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Vandermonde matrix, 29, 393 
Vapnik—Chernovenkis bound, 62 
variance, 426, 430 

properties, 429 

sample, 89, 455 

sample-, 455 
vector quantization, 143 
vector space, 355 

basis, 355 

dimension, 355 


Voronoi tessellation, 143 


W 

weak derivative, 113 

weak duality, 407 

weak learners, 313 

Weibull distribution, 425 

weight matrix (deep learning), 326 
Wolfe dual program, 407 

Woodbury identity, 248, 351, 371, 399 


